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IN. THE UNITED STATES PATENT AND TRADEMARK OFFICE 

DECLARATION OF TOD BEDILION, Ph.D. 
. UNDER ,37 C.F,R. § 1.132 

I, TOD BEDILION, Ph.D., declare and state as 

follows : 

1. In April, 1996, I became the first employee of 
Synteni, Inc., where I served as Research Director until its 
acquisition by Incyte Corporation in early 1998. After 
Synteni' s acquisition, I continued in the position of Director 
of Corporate Development at Incyte until May 11, 2001. I am 
currently the Director of "Business Development at Genomic 
Health, Inc., Redwood City, California and an occasional 
Consultant to Incyte. 

2. Synteni was founded to commercialize expression 
microarrays, microarrays in which expressed nucleic acids — 
full-length cDNAs, fragments of full-length cDNAs, expressed 
sequence tags (ESTs) are arrayed on a common support to 
permit highly parallel detection and measurement of the 
expression of their cognate genes in a biological sample. 

3. During my employ at Synteni, virtually all (if 
not all) of my work efforts were directed to the further 
technical development and the commercial exploitation of that 
microarray technology; given the small size of our shop, most 
of us had both technical and commercial responsibilities. The 
customer accounts for which I was personally responsible 
included large pharmaceutical companies, such as SmithKline 



Beech**, la rg e biotechnolooy companies> such ^ Genentec a 
small research institutes, such as DRAX Inc. 

■ I- From my very first interaction with our 
customers, consistently thro ug h t0 Synteni's acquisition by 
Incyte, I heard uniform, consistent, and emphatic requests 
that more genes be added to the arrays. This was true with 
respect to both our original nicroarrays, based on customer- 
provided genes and h'h^rioc _ 

y no lib.anes, and our later, "jeneric", oene 

expression microarrays, based upon the uni g ene clone ' 
collection (our so-called "□niGem" arrays, . From day ! the 

pressure on us was to r^-t^t- ~ 

was to print ever more spots on the array. It 
was neve r a auestioTr ^ 

„ K question, our customers wanted ever more genes on 

the array, each new gene-specific probe providing 
incrementally more value to the customer.' 

5- As a commercial enterprise, providing value to 
our customers was our m=-i^v* „ 

as our ma^or concern. Thus, to increase the 
value of our Drodnrt^ ~~ 

products and services in the marketplace — to 

increase our ability to sen r,„r- 

y xo sen our microarrays and microarrav 
services, their "salabilitv" — ft „r eff - ^ 
w . . ty our efforts from the very 

beginning were devoted to increasing the number of specific 
genes whose expression could be detected with our microarrays. 

6- Indeed, one of our ma;or competitive advantages 

m tne marketplace n«->r 

P-ace not 3 ust as regards other commercial 

suppliers, but also with respect to the innumerable 
laboratories and companies that were attempting to spot arrays 
in their own "home-brew" f ? HHf; 

ew facilities — was the number of 

J I should note the m^i-^™* 

specific to only these W S ~! tor addition of psobM 

encoded gene product waf known bu^ 2/ J«log 1C al Unction of the 

and all expressed genes Z 3Sking for P robes specific to any 
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distinct gene-specific probes that we provided on our 
expression microarrays . Our first 10, 000. element UniGem array 
put the holy, grail of gene expression analysis - the human 
whole genome array - within . sight for the ■ very first time 
(with respect to timing of the UniGEM program we began project 
planning and technology development in mid 1996 and'deli vexed 

our first 10, 000 element *- a r, j - - 

' u «J.eme..t Suandora content numan arrays in the 

first months of 1997 as I recall) . 

7. By the end of 1997, our efforts to provide the 
most comprehensive, and thus most valuable, human gene 
expression microarrays had been sufficiently successful that 
incyte agreed to acquire Synteni for a reported $80 million. 

8. I declare further that all statements made 
herein of my own knowledge are true and that ail statements 
made on information and belief are believed to be true, and 
further that these statements were made with the knowledge 
that willful false statements and the like so made are 
punishable by fine or imprisonment, or both, under 
Section 1001 of Title 18 of the United States Code and may 
jeopardize the validity of any patent application in which 
this declaration is filed or any patent that issues thereon. 




Tod Bedilion, Ph. D. Date 
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The present invention involves methods and compositions for identifying genes which are differentially expressed in a normal healthy 
animal and an animal having a selected disease or infection, and methods for diagnosing diseases or infections characterized by the presence 
of those genes, despite the absence of knowledge about the gene or its function. The methods involve the use of a composition suitable 
for use in hybridization which consists of a solid surface on which is immobilized at pre-defined regions thereon a plurality of defined 
oligonucleotide/polynucleotide sequences for hybridization. Each sequence comprises a fragment of an EST isolated from an identified 
DNA library prepared from tissue or cell samples of a healthy animal, an animal with a selected disease or infection, and any combination 
thereof. Differences in hybridization patterns produced through use of this composition and the specified methods enable diagnosis of 
disease based on differential expression of genes of unknown function, and enable the identification of those genes and the proteins encoded 
thereby. 
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differentially expressed genes in healthy and diseased subjects 

Cross Reference to Related Applications: 
5 This application is a continuation-in-part application of U.S. Serial No. 

08/195,485 filed February 14, 1994, the contents of which are incorporated herein by 
reference. 

Field of the Invention 

10 The present invention relates to the use of immobilized 

oligonucleotide/polynucleotide or polynucleotide sequences for the identification, 
sequencing and characterization of genes which are implicated in disease, infection, 
or development and the use of such identified genes and the proteins encoded thereby 
in diagnosis, prognosis, therapy and drug discovery. 

15 

Background of the Invention 

Identification, sequencing and characterization of genes, especially 
human genes, is a major goal of modern scientific research. By identifying genes, 
determining their sequences and characterizing their biological function, it is possible 

20 to employ recobinant DNA technology to produce large quantities of valuable "gene 
products", e.g., proteins and peptides. Additionally, knowledge of gene sequences 
can provide a key to diagnosis, prognosis and treatment of a variety of disease states 
in plants and animals which are characterized by inappropriate expression and/or 
repression of selected gene(s) or by the influence of external factors, e.g., carcinogens 

25 or teratogens, on gene function. The term disease-associated genes(s) is used herein 
in its broadest sence to mean not only genes associated with classical inherited 
diseases, but also those associated with genetic predisposition to disease as well as 
infectious or pathogenic states resulting from gene expression by infectious agents or 
the effect on host cell gene expression by the presence of such a pathogen or its 

30 products Locating disease-associated genes will permit the development of 
diagnostic and prognostic reagents and methods, as well as possible therapeutic 
regimens, and the discovery of new drugs for treating or preventing the occurrence of 
, such diseases. 

Methods have been described for the identification of certain novel 
35 gene sequences, referred to as Expressed Sequence Tags (EST) [see, e.g., Adams et 
al, Science , 252:1651-1656 (1991); and International Patent Application No. 
WO93/00353, published January 7, 1993]. Conventially, an EST is a specific cDNA 
polynucleotide sequence, or tag, about 150 to 400 nucleotides in length, derived from 
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a messenger RNA molecule by reverse transcription, which is a marker for, and 
component of, a human gene actually transcribed in vivo. However, as used herein an 
EST also refers to a genomic DNA fragment derived from an organism, such as a 
microorganism.the DNA of which lacks intron regions. 

5 A variety of techniques have been described for identifying particular 

gene sequences on the basis of their gene products. For example, several techniques 
are described in the art [see, e.g., International Patent Application No. WO91/07087, 
published May 30, 1991]. Additionally, known methods exist for the amplification of 
desired sequences [see, e.g., International Patent Application No. W091/17271, 

10 published November 14, 1991, among others]. 

However, at present, there exist no established methods for filling the 
need in the art for methods and reagents which employ fragments of differentially 
expressed genes of known, unknown (or previously unrecognized ) function or 
consequence to provide diagnostic and therapeutic methods and reagents for diagnosis 

15 and treatment of disease or infection, which conditions are characterized by such 
genes and gene products. It should be appreciated that it is the expression differences 
that are diagnostic of the altered state (e.g., predisease, disease, pathogenic, 
progression or infectious). Such genes associated with the altered state are likely to 
be the targets of drug discovery, whether the genes are the cause or the effect of the 

20 condition, identification of such genes provides insight into which gene expression 
needs to be re-altered in order to reestablished the healthy state. 

Summary of the Invention 

In one aspect, the invention provides methods for identifying gene(s) 

25 which are differentially expressed, for example, in a normal healthy organism and an 
organism having a disease. The method involves producing and comparing 
hybridization patterns formed between samples of expressed mRNA or cDNA 
polynucleotide sequences obtained from either analogous cells, tissues or organs of a 
healthy organism and a diseased organism and a defined set of 

30 oligonucleotide/polynucleotide/polynucleotide sequence probes from either an 
healthy organism or a diseased organism immobilized on a support. Those defined 
oligonucleotide/polynucleotide sequences are representative of the total expressed 
genetic component of the cells, tissues, organs or organism as defined the collection 
of partial cDNA sequences (ESTs). The differences between the hybridization 

35 patterns permit identification of those particular EST or gene-specific 
oligonucleotide/polynucleotide sequences associated with differential expression, and 
the identification of the EST permits identification of the clone from which it was 
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derived and using ordinary skill further cloning and, if desired, sequencing of the full- 
length cDNA and genomic counterpart, i.e., gene, from which it was obtained. 

In another aspect, the invention provides methods substantially similar 
to those described above, but which permit identification of those gene(s) of a 
5 pathogen which are expressed in any biological sample of an infected organism based 
on comparative hybridization of RNA/cDNA samples derived from a healthy versus 
infected organism, hybridized to an oligonucleotide/polynucleotide set representative 
of the gene coding complement of the pathogen of interest. 

In another aspect, the invention provides methods substantially similar 

10 to those described above, but which permit identification of those ESTs-specific 
oligonucleotide/polynucleotide sequences of host gene(s) which represent genes being 
differentially expressed/ altered in expression by the disease state, or infection and are 
expressed in any biological sample of an infected organism based on comparative 
hybridization of RNA/cDNA samples derived from a healthy versus infected 

15 organism of interest 

In a further aspect, the methods described above and in detail below, 
also provide methods for diagnosis of diseases or infections characterized by 
differentially expressed genes, the expression of which has been altered as a result of 
infection by the pathogen or disease causing agent in question. All identified 

20 differences provide the basis for diagnostic testing be it the altered expression of 
endogenous genes or the patterned expression of the genes of the infecting organism. 
Such patterns of altered expression are defined by comparing RNA/cDNA from the 
two states hybridized against a panel of oligonucleotide/polynucleotides representing 
the expressed gene component of a cell, tissue, organ or organism as defined by its 

25 collection of ESTs. 

Yet a further aspect of this invention provides a composition suitable 
for use in hybridization, which comprises a solid surface on which is immobilized at 
pre-defined regions thereon a plurality of defined oligonucleotide/polynucleotide 
sequences for hybridization, each sequence comprising a fragment of an EST isolated 

30 from a cDNA or DNA library prepared from at least one selected tissue or cell 
sample of a healthy (i.e., pre-disease state) animal, at least one analogous sample of 
an animal having a disease, at least one analogous sample of an animal infected with a 
pathogen or the pathogen itself, or any combination or multiple combinations thereof. 

An additional aspect of the invention provides an isolated gene 

35 sequence which is differentially expressed in a normal healthy animal and an animal 
having a disease, and is identified by the methods above. Similarly, an isolated 
pathogen gene sequence which is expressed in tissue or cell samples of an infected 
animal can be identified by the methods above. 

3 
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Yet another aspect of the invention is that it provides not only a means 
for a static diagnostic but also provides a means for a carrying out the procedure over 
time to measure disease progression as well as monitoring the efficacy of disease 
treatment regimes including an toxicological effects thereof. 
5 Another aspect of the invention is an isolated protein produced by 

expression of the gene sequences identified above. Such proteins are useful in 
therapeutic compositions or diagnostic compositions, or as targets for drug 
development. 

Other aspects and advantages of the present invention are described 
10 further in the following detailed description of the preferred embodiments thereof. 

Detailed Description of the Invention 

The present invention meets the unfulfilled needs in the art by 
providing methods for the identification and use of gene fragments and genes, even 

15 those of unknown full length sequence and unknown function, which are 
differentially expressed in a healthy animal and in an animal having a specific disease 
or infection by use of ESTs derived from DNA libraries of healthy and/or 
diseased/infected animals. Employing the methods of this invention permits the 
resulting identification and isolation of such genes by using their corresponding ESTs 

20 and thereby also permits the production of protein products encoded by such genes. 
The genes themselves and/or protein products, if desired, may be employed in the 
diagnosis or therapy of the disease or infection with which the genes are associated 
and in the development of new drugs therefor. 

It has been appreciated that one or more differentially identified EST 

25 or gene- specific oligonucleotide/polynucleotides define a pattern of differentially 
expressed genes diagnostic of a predisease, disease or infective state. A knowledge of 
the specific biological function of the EST is not required only that the ESTs 
identifies a gene or genes whose altered expression is associated reproducibly with 
the predisease, disease or infectious state. The differences permit the identification of 

30 gene products altered in their expression by the disease and represent those products 
most likely to be targets of therapeutic intervention. Similarly, the product may be of 
the infecting organism itself and also be an effective target of intervention. 

/. Definitions. 

35 Several words and phrases used throughout this specification are 

defined as follows: 

As used herein, the term "gene" refers to the genomic nucleotide 
sequence from which a cDNA sequence is derived, which cDNA produces an EST, as 

4 
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described below. The term gene classically refers to the genomic sequence, which, 
upon processing, can produce different cDNAs^ e.g., by splicing events. However, 
for ease of reading, any full-length counterpart cDN A sequence which gives rise to an 
EST will also be referred to by shorthand herein as a "gene 1 . 
5 The term "organism" includes without limitation, microbes, plants and 

animals. 

The term "animal" is used in its broadest sense to include all members 
of the animal kingdom, including hiimaris. It should be understood, however, that 
according to this invention the same species of animal which provides the biological 
10 sample also is the source of the defined immobilized oligonucleotide/polynucleotides 
as defined below. 

The term "pathogen" is defined herein as any molecule or organism 
which is capable of infecting an animal or plant and replicating its nucleic acid 
sequences in the cells or tissues of that animal or plant . Such a pathogen is generally 

15 associated with a disease condition in the infected animal or plant. Such pathogens 
may include viruses, which replicate intra- or extra-cellularly, or other organisms, 
such as bacteria, fungi or parasites, which generally infect tissues or the blood. 
Certain pathogens or microorganisms are known to exist in sequential and 
distinguishable stages of development, e.g., latent stages, infective stages, and stages 

20 which cause symptomatic diseases. In these different stages, the pathogens are 
anticipated to express differentially certain genes and/or turn on or off host cell gene 
expression. 

As used herein, the term "disease" or "disease state" refers to any 
condition which deviates from a normal or standardized healthy state in an organism 

25 of the same species in terms of differential expression of the organism's genes. In 
other words, a disease state can be any illness or disorder be it of genetic or 
environmental origin , for example, an inherited disorder such as certain breast 
cancers, or a disorder which is characterized by expression of gene(s) normally in an 
inactive, 'turned off state in a healthy animal, or a disorder which is characterized by 

30 under-expression or no expression of gene(s) which is normally activated or 'turned 
on' in a normal healthy animal. Such differential expression of genes may also be 
detected in a condition caused by infection, inflammation, or allergy, a condition 
caused by development or aging of the animal, a condition caused by administration 
of a drug or exposure of the animal to another agent, e.g., nutrition, which affects 

35 gene expression. Essentially, the methods described herein can be adapted to detect 
differential gene expression resulting from any cause, by manipulation of the defined 
oligonucleotide/polynucleotides and the samples tested as described below. The 
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concept of disease or disease state also includes its temporal aspects in terms of 
progression and treatment. 

The phrase "differentially expressed" refers to those situations in 
which a gene transcript is found in differing numbers of copies, or in activated vs 

5 inactivated states, in different cell types or tissue types of an organism, having a 
selected disease as contrasted to the levels of the gene transcript found in the same 
cells or tissues of a healthy organism. Genes may be differentially expressed in 
differing states of activation in microorganisms or pathogens in different stages of 
development For example, multiple copies of gene transcripts may be found in an 

10 organism having a selected disease, while only one, or significantly fewer copies, of 
the same gene transcript are found in a healthy organism, or vice-versa. 

As used herein, the term "solid support" refers to any known substrate 
which is useful for the immobilization of large numbers of 
oligonucleotide/polynucleotide sequences by any available method to enable 

15 detectable hybridization of the immobilized oligonucleotide/polynucleotide sequences 
with other polynucleotide sequences in a sample. Among a number of available solid 
supports, one desirable example is the supports described in International Patent 
Application No. WO91/07087, published May 30, 1991.Also useful are suports such 
as but not limited to nitrocellulose, mylein, glass, silica ans Pall Biodyne C® It is 

20 also anticipated that improvements yet to be made to conventional solid supports may 
also be employed in this invention. 

The term "surface" means any generally two-dimensional structure on 
a solid support to which the desired oligonucleotide/polynucleotide sequence is 
attached or immobilized. A surface may have steps, ridges, kinks, terraces and the 

25 like. 

As used herein, the term "predefined region" refers to a localized area 
on a surface of a solid support on which is immobilized one or multiple copies of a 
particular oligonucleotide/polynucleotide sequence and which enables the 
identification of the oligonucleotide/polynucleotide at the position, if hybridization of 
30 that oligonucleotide/polynucleotide to a sample polynucleotide occurs. 

By "immobilized" refers to the attachment of the 
oligonucleotide/polynucleotide to the solid support. Means of immobilization are 
known and conventional to those of skill in the art, and may depend on the type of 
support being used. 

35 By "EST" or "Expressed Sequence Tag" is meant a partial DNA or 

cDNA sequence of about 150 to 500, more preferably about 300, sequential 
nucleotides of a longer sequence obtained from a genomic or cDNA library prepared 
from a selected cell, cell type, tissue or tissue type, organ or organism which longer 

6 
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sequence corresponds to an mRNA of a gene found in that library. An EST is 
generally DNA. One or more libraries made from a single tissue type typically 
provide at least about 3000 different (i.e., unique) ESTs and potentially the full 
complement of all possible ESTs representing all cDNAs e.g., 50,000-100,000 in an 
5 animal such as a human. Further background and information on the construction of 
ESTs is described in M. D. Adams et al, Science . 252:1651-1656 (1991); and 
International Application Number PCTAJS92/652^ 7, 1993). 

As used herein, the term "defined oligonucleotide/polynucleotide 
sequence" refers to a known nucleotide sequence fragment of a selected EST or gene. 
10 This term is used interchangeably with the term "fragments of EST". These 
sequential sequences are generally comprised of between about 15 to about 45 
nucleotides and more preferably between about 20 to about 25 nucleotides in length. 
Thus any single EST of 300 nucleotides in length may provide about 280 different 
defined oligonucleotide/polynucleotide sequences of 20 nucleotides in length (e.g., 
15 20-mers). The lengths of the defined oligonucleotide/polynucleotides may be readily 
increased or decreased as desired or needed, depending on the limitations of the solid 
support on which they may be immobilized or the requirements of the hybridization 
conditions to be employecLThe length is generally guided by the principle that it 
should be of sufficient length to insure that it is one average only represented once in 
20 the population to be examined. Generally, these defined 

oligonucleotide/polynucleotides are RNA or DNA and are preferably derived from 
the anti-sense strand of the EST sequence or from a corresponding mRNA sequence 
to enable their hybridization with samples of RNA or DNA. Modified nucleotides 
may be incorporated to increase stability and hybridization properties. 
25 By the term "plurality of defined oligonucleotide/polynucleotide 

sequences" is meant the following. A surface of a solid support may immobilize a 
large number of "defined oligonucleotide/polynucleotides". For example, depending 
upon the nature of the surface, it can immobilize from about 300 to upwards of 
60,000 defined 20-mer oligonucleotide/polynucleotides. It is anticipated that future 
30 improvements to solid surfaces will permit considerably larger such pluralities to be 
immobilized on a single surface. A "plurality" of sequences refers to the use on any 
one solid support of multiple different defined oligonucleotide/polynucleotides from a 
single EST from a selected library, as well as multiple different defined 
oligonucleotide/polynucleotides from different ESTs from the same library or many 
35 libraries from the same or different tissues, and may also include multiple identical 
copies of defined oligonucleotide/polynucleotides. Ultimately a pluarality has at least 
one oligonucleotide/polynucleotide per expressed gene in the entire organism For 
example, from a library producing about 5,000-10,000 ESTs, a single support can 
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include at least about 1-20 defined oligonucleotide/polynucleotides representing every 
EST in that library. The composition of defined oligonucleotide/polynucleotides 
which make up a surface according to this invention may be selected or designed as 
desired. 

5 The term "sample" is employed in the description of this invention in 

several important ways. As used herein, the term "sample** encompasses any cell or 
tissue from an organism. Any desired cell or tissue type in any desired state may be 
selected to form a sample. For example, the sample cell desired may be a human T 
cell; the desired cell type for use in this invention may be a quiescent T cell or an 

10 activated T cell. 

By the phrase "analogous sample" or "analogous cell or tissue" is 
meant that according to this invention when the ESTs which provide the defined 
oligonucleotide/polynucleotides are produced from a cDNA library prepared from a 
single tissue or cell type source sample, e.g., liver tissue of a human, then the samples 

15 used to hybridize to those immobilized defined oligonucleotide/polynucleotides are 
preferably provided by the same type of sample from either a healthy or diseased 
animal, i.e., liver tissue of a healthy human and liver tissue of a diseased or infected 
human or from a human suspected of having that disease or infection. Alternatively, 
if the surface contains defined oligonucleotide/polynucleotides from multiple cells or 

20 tissues, then the "samples" which are hybridized thereto can be but are not limited to 
samples obtained from analogous multiple tissues or cells. 

By the term "detectably hybridizing" means that the sample from the 
healthy organism or diseased or infected organism is contacted with the defined 
oligonucleotide/polynucleotides on the surface for sufficient time to permit the 

25 formation of patterns of hybridization on the surfaces caused by hybridization 
between certain polynucleotide sequences in the samples with the certain immobilized 
defined oligonucleotide/polynucleotides. These patterns are made detectable by the 
use of available conventional techniques, such as fluorescent labelling of the samples. 
Preferably hybridization takes place under stringent conditions, e.g., revealing 

30 homologies of about 95%. However, if desired, other less stringent conditions may 
be selected. Techniques and conditions for hybridization at selected stringencies are 
well known in the art [see, e.g., Sambrook et al, M o11ecular Cloning. A Laboratory 
Manual. . Cold Spring Harbor Laboratory, Cold Spring Harbor, NY (1989)]. 

35 //. Compositions of The Invention 

The present invention is based upon the use of ESTs from any desired 
cell or tissue in known technologies for oligonucleotide/polynucleotide hybridization. 
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A. ESTs 

An EST, as defined above, is for an animal, a sequence from a 
cDNA clone that corresponds to an mRNA. The EST sequences useful in the present 
invention are isolated preferably from cDNA libraries using a rapid screening and 
5 sequencing technique. Custom made cDNA libraries are made using known 
techniques. See, generally, Sambrook et al, cited above. Briefly, mRNA from a 
selected cell or tissue is reverse transcribed into complementary DNA (cDNA) using 
the reverse transcriptase enzyme and made double-stranded using RNase H coupled 
with DNA polymerase or reverse transcriptase. Restriction enzyme sites are added to 

10 the cDNA and it is cloned into a vector. The result is a cDNA library. Alternatively, 
commercially available cDNA libraries may be used. Libraries of cDNA can also be 
generated from recombinant expression of genomic DNA using known techniques, 
including polymerase chain reaction-derived techniques. 

ESTs (which can range from about 150 to about 500 nucleotides in 

15 length, preferably about 300 nucleotides) can be obtained through sequence analysis 
from either end of the cDNA insert Desirably, the DNA libraries used to obtain 
ESTs use directional cloning methods so that either the 5' end of the cDNA (likely to 
contain coding sequence) or the 3' end (likely to be a non-coding sequence) can be 
selectively obtained. 

20 In general, the method for obtaining ESTs comprises applying 

conventional automated DNA sequencing technology to screen clones, 
advantageously randomly selected clones, from a cDNA library. The cDNA libraries 
from the desired tissue can be preprocessed, or edited, by conventional techniques to 
reduce repeated sequencing of high and intermediate abundance clones and to 

25 maximize the chances of finding rare messages from specific cell populations. 
Preferably, preprocessing includes the use of defined composition prescreening 
probes, e.g., cDNA corresponding to mitochondria, abundant sequences, ribosomes, 
actins, myelin basic polypeptides, or any other known high abundance peptide. These 
prescreening probes used for preprocessing are generally derived from known ESTs. 

30 Other useful preprocessing techniques include subtraction hybridization, which 
preferentially reduces the population of highly represented sequences in the library 
[e.g., see Fargnoli et al, Anal. Biochem. . 182:364 (1990)] and normalization, which 
results in all sequences being represented in approximately equal proportions in the 
library [Patanjali et al, Proc. Natl. Acad. Sci. USA. £fi:1943 (1991)]. Additional 

35 prescreening/differential screening approaches are known to those skilled in the art. 

ESTs can then be generated from partial DNA sequencing of the 
selected clones. The ESTs useful in the present invention are preferably generated 
using low redundancy of sequencing, typically a single sequencing reaction. While 
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single sequencing reactions may have an accuracy as low as 90%, this nevertheless 
provides sufficient fidelity for identification of the sequence and design of PCR 
primers. 

If desired, the location of an EST in a full length cDNA is determined 
5 by analyzing the EST for the presence of coding sequence. A conventional computer 
program is used to predict the extent and orientation of the coding region of a 
sequence (using all six reading frames). Based oh this information, it is possible to 
infer the presence of stah or stop cc^ 

is completely coding or completely non-coding or a combination of the two. If start 
10 or stop codons are present, then the EST can cover both part of the S'-untranslated or 
3-untranslated part of the mRNA (respectively) as well as part of the coding 
sequence. If no coding sequence is present, it is likely that the EST is derived from 
the 3' untranslated sequence due to its longer length and the fact that most cDNA 
library construction methods are biased toward the 3' end of the mRNA. It should be 
15 understood that both coding and non-coding regions may provide ESTs equally useful 

in the described invention, 

A number of specific ESTs suitable for use in the present 
invention are described above Adams et al (supra), which may be incorporated by 
reference herein, to describe non-essential examples of desirable ESTs. Other ESTs 

20 exist in the art which may also be useful in this invention, as will ESTs yet to be 
developed by these known techniques. 

B. Preparing the Solid Support of the Invention 

Oligonucleotide sequences which are fragments of defined 
sequence are derived from each EST by conventional means, e.g„ conventional 

25 chemical synthesis or recombinant techniques. Each defined 

oligonucleotide/polynucleotide sequence as described above is a fragment, can be, but 
is not necessarily an anti-sense fragment, of an EST isolated from a DNA library 
prepared from a selected cell or tissue type from a selected animal. For use in the 
present invention, it is presently preferred that the defined 

30 oligonucleotide/polynucleotide sequences are 20-25mers. As described above, for 
each EST a number of such 20-25mers may be generated. The lengths may vary as 
described above as well as the composition. For example 
oligonucleotide/polynucleotides can be modified based on the Oligo 4.0 or simiolar 
programs to predict hybridization potential or to include modifieid nucleotides for the 

35 reasons given above. It is alos appreciated that large DNA segments may be 
employed including entire ESTs or even full length genes particular when inserted 
into cloning vectors. 
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A plurality of these defined oligonucleotide/polynucleotide 
sequences are then attached to a selected solid support conventionally used for the 
attachment of nucleotide sequences again by known means. In contrast to other 
technologies available in the art, this support is designed to contain defined, not 
5 random, oligonucleotide/polynucleotide sequences, The EST fragments, or defined 
oligonucleotide/polynucleotide sequences, immobilized on the solid support can 
include fragments of one or more ESTs from a library of at least one selected tissue 
" "or cell s^pfte 6^^ having 
a disease, at least one analogous sample of the animal infected with a pathogen, and 
10 any combination thereof. 

Numerous conventional methods are employed for attaching 
biological molecules such as oligonucleotide/polynucleotide sequences to surfaces of 
a variety of solid supports. See, e.g., Affinity Techniques, Enz vme Purification: Part 
B. Methods in Enzvmology , Vol. 34, ed. W.B. Jakoby, M. Wilcheck, Acad. Press, 
IS NY (1974): Immobilized Biochemicals and Affinity Chromatography. Advances in 
Experimental Medicin e and Biology, vol. 42, ed. R. Dunlap, Plenum Press, NY 
(1974); U. S. Patent No. 4,762,881; U. S. Patent No. 4,542,102; European Patent 
Publication No. 391,608 (October 10, 1990); U. S. Patent No. 4,992,127 (Nov. 21, 
1989). 

20 One desirable method for attaching 

oligonucleotide/polynucleotide sequences derived from ESTs to a solid support is 
described in International Application No. PCT/US90/06607 (published May 30, 
1991). Briefly, this method involves forming predefined regions on a surface of a 
solidsupport, where the predefined regions are capable of immobilizing ESTs. The 

25 methods make use of binding substances attached to the surface which enable 
selective activation of the predefined regions. Upon activation, these binding 
substances become capable of binding and immobilizing 
oligonucleotide/polynucleotides based on EST or longer gene sequences. 

Any of the known solid substrates suitable for binding 

30 oligonucleotide/polynucleotides at pre-defined regions on the surface thereof for 
hybridization and methods for attaching the oligonucleotide/polynucleotides thereto 
may be employed by one of skill in the art according to this invention. Similarly, 
known conventional methods for making hybridization of the immobilized 
oligonucleotide/polynucleotides detectable, e.g., fluorescence, radioactivity, 

35 photoactivation, biotinylation, solid state circuitry, and the like may be used in this 
invention. 

Thus, by resorting to known techniques, the invention provides 
a composition suitable for use in hybridization which consists of a surface of a solid 

11 



t 



I 



WO 95/21944 PCIYUS95/01863 

support on which is immobilized at pre-defined regions on said surface a plurality of 
defined oligonucleotide/polynucleotide sequences for hybridization. For example, 
one composition of this invention is a solid support on which are immobilized oligos 
of EST fragments from a library constructed from a single cell type, e.g., a human 
5 stem cell, or a single tissue, e.g., human liver, from a healthy human. Still another 
composition of this invention is another solid support on which are immobilized 
oligos of EST fragments from a library constructed from a single cell type or a tissue 
from a human havmg a selected disease or predispositon to a selected disease, e.g., 
liver cancer. 

10 Another embodiment of the compositions of this invention 

include a single solid support having oligonucleotides of ESTs from both single cell 
or single tissue libraries from both a healthy and diseased human. Still other 
embodiments include a single support on which are immobilized oligos of EST 
fragments from more than one tissue or cell library from a healthy human or a single 
15 support on which are immobilized more than one tissue or cell library from both 
healthy and diseased animals or humans. A preferred composition of this invention is 
anticipated to be a single support containing oligos of ESTs for all known cells and 
tissues from a selected organism. 

20 ///. The Methods of the Invention 

A. Identification of Genes 

The present invention employs the compositions described 
above in methods for identifying genes which are differentially expressed in a normal 
healthy organism and an organism having a disease or infection. These methods may 
25 be employed to detect such genes, regardless of the state of knowledge about the 
function of the gene. The method of this invention by use of the compositions 
containing multiple defined EST fragments from a single gene as described above is 
able to detect levels of expression of genes or in other cases simply the expression or 
lack thereof, which differ between normal, healthy organisms and organisms having a 
30 selected disease, disorder or infection. 

One such method employs a first surface of a solid support on 
which is immobilized at pre-defined regions thereon a plurality of defined 
oligonucleotide/polynucleotide sequences, described above, of ESTor longer gene 
fragment isolated from a cDNA library prepared from at least one selected tissue or 
35 cell sample of a healthy animal (the "healthy test surface") and a second such surface 
on which is immobilized at pre-defined regions a plurality of defined 
oligonucleotide/polynucleotide sequences of ESTor longer gene fragment isolated 
from at least one analogous tissue of an animal having a selected disease (the "disease 
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test surface"). These test surfaces may be standardized for the selected animal or 
selected cell or tissue sample from that animal (i.e., they are prescreened for 
polymorphisms in the species population). 

Polynucleotide sequences are then isolated from mRNA and/or 
5 cDNA from a biological sample from a known healthy animal ("healthy control") and 
a second sample is similarly prepared from a sample from a known diseased animal 
("disease sample"). These two samples are desirably selected from the cell or tissue 
analogous to that which provided the immobilized oligonucleotide/polynucleotides. 

According to the method the healthy control sample is 

10 contacted with one set of the healthy test surfape and the disease test surface 
described above for a time sufficient to permit detectable hybridization to occur 
between the sample and the immobilized defined oligonucleotide/polynucleotides on 
each surface. The results of this hybridization are a first hybridization pattern formed 
between the nucleotides of healthy control and the healthy test surface and a second 

15 hybridization pattern formed between the nucleotides of healthy control sample and 
the disease test surface. 

In a similar manner, the disease sample is detectably hybridized 
to another set of healthy test and disease test surfaces, forming a third hybridization 
pattern between the disease sample and healthy test surface and a fourth hybridization 

20 pattern between the disease sample and the disease test surface. 

Comparing the four hybridization patterns permits detection of 
those defined oligonucleotide/polynucleotides which are differentially expressed 
between the healthy control and the disease sample by the presence of differences in 
the hybridization patterns at pre-defined regions. The 

25 oligonucleotide/polynucleotides on each surface which correspond to the pattern 
differences may be readily identified with the corresponding ESTor longer gene 
fragment from which the oligonucleotide/polynucleotides are obtained. 

In another embodiment of the method of this invention, the 
same process is employed, with the exception that plurality of defined 

30 oligonucleotide/polynucleotide sequences forming the healthy test sample and the 
disease test sample surfaces are immobilized on a single solid support. For example, 
each fragment of an EST or longer gene fragment on the surface is isolated from at 
least two cDNA libraries prepared from a selected cell or tissue sample of a healthy 
animal and an analogous selected cell or tissue sample of an animal having a disease. 

35 According to this embodiment, the healthy control sample is 

detectably hybridized to a copy of this single solid surface, forming one hybridization 
pattern with oligonucleotide/polynucleotides associated with both the healthy and 
diseased animal. Similarly, the disease sample is detectably hybridized to a second 
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copy of this single solid surface, forming one hybridization pattern with 
oligonucleotide/polynucleotides associated with both the healthy and diseased animal. 

Comparing the two hybridization patterns permits detection of 
those defined oligonucleotide/polynucleotides which are differentially expressed 

5 between the healthy control and the disease sample by the presence of differences in 
the hybridization patterns at pre-defined regions. The 
oligonucleotide/polynucleotides on each surface which correspond to the pattern 
differences may be readily identified with the corresponding ESTor longer gene 
fragment from which the oligonucleotide/polynucleotides are obtained. 

10 The identification of one or more ESTs as the source of the 

defined oligonucleotide/polynucleotide which produced a "difference" in 
hybridization patterns according to these methods permits ready identification of the 
gene from which those ESTs were derived. Because oligonuleotides are of sufficient 
length that they will hybridize under stringent conditions only with a RNA/cDNA for 

15 that gene to which they correspond, the oligo can be used to identify the EST and in 
turn the clone from which it was derived and by subsequent cloning, obtain the 
sequence of the full-length cDNA and its genomic counterparts, i.e., the gene, from 
which it was obtained. 

In other words, the ESTs identified by the method of this 

20 invention can be employed to determine the complete sequence of the mRNA, in the 
form of transcribed cDNA, by using the EST as a probe to identify a cDNA clone 
corresponding to a full-length transcript, followed by sequencing of that clone. The 
EST or the full length cDNA clone can also be used as a probe to identify a genomic 
clone or clones that contain the complete gene including regulatory and promoter 

25 regions, exons, and introns. 

It should be appreciated that one does not have to be restricted 
in using ESTs from a particular tissue from which probe RNA or cDNA is obtained, 
rather any or all ESTs (known or unknown) may be placed on the support 
Hybridization will be used a form diagnostic patterns or to identifiy which particular 

30 EST is detected. For example, all known ESTs from an organism are used to produce 
a "master" solid support to which control sample and disease samples are alternately 
hybridized. One then detects a pattern of hybridization associated with the particular 
disaease state which then forms the basis of a diagnostic test or the isolation of 
disease specific ESTs from which the intact gene may be cloned and sequenced 

35 leading uiltimately to a defined therapuetic target. 

Methods for obtaining complete gene sequences from ESTs are 
well-known to those of skill in the art. See, generally, Sambrook et al, cited above. 
Briefly, one suitable method involves purifying the DNA from the clone that was 
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sequenced to give the EST and labeling the isolated insert DNA. Suitable labeling 
systems are well known to those of skill in the art [see, eg. Basic Methods in 
Molecular Biology, L. G. Davis et al, ed., Elsevier Press, NY (1986)]. The labeled 
EST insert is then used as a probe to screen a lambda phage cDNA library or a 
5 plasmid cDNA library, identifying colonies containing clones related to the probe 
cDNA which can be purified by known methods. The ends of the newly purified 
clones are then sequenced to identify full length sequences and complete sequencing 
of full length clones is performed by enzymatic digestion or primer walking. A 
similar screening and clone selection approach can be applied to clones from a 
10 genomic DNA library. 

Additionally, an EST or gene identified by this method as 
associated with inherited disorders can be used to determine at what stage during 
embryonic development the selected gene from which it is derived is developed by 
screening embryonic DNA libraries from various stages of development, e.g. 2-cell, 
15 8-cell, etc., for the selected gene. As has been mentioned above, the invention may 
be applied in addtional temporal modes for monitoring the progression of a disease 
state, the efficacy of a particular treatment modality or the aging process of an 
individual. 

Thus, the methods of this invention permit the identification, 
20 isolation and sequencing of a gene which is differentially expressed in a selected 
disease/infection. As described in more detail below, the identified gene may then be 
employed to obtain any protein encoded thereby, or may be employed as a target for 
diagnostic methods or therapeutic approaches to the treatment of the disease, 
including, e.g M drug development 
25 The same methods as described above for the identification of 

genes, including genes of unknown function, which are differentially expressed in a 
disease state, may also be employed to identify other genes of interest. For example, 
another embodiment of this invention includes a method for identifying a gene of a 
pathogen which is expressed in a biological sample of an animal infected with that 
30 pathogen or the gene of the host which is altered in its expression as a result of the 
infection. 

One such method employs a healthy test surface as described 
above, employing defined oligonucleotide/polynucleotides from a sample of a 
healthy, uninfected animal. The second such surface has immobilized at pre-defined 
35 regions thereon a plurality of defined oligonucleotide/polynucleotide sequences of 
ESTs isolated from at least one analogous tissue or cell sample of an infected animal 
(the "infection test surface"). Polynucleotide sequences are isolated from a biological 
sample from a healthy animal ("healthy control") and a second sample is similarly 
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prepared from an animal infected with the selected pathogen ("infection sample"). 

These two samples are desirably selected from the cell or tissue analogous to that 

which provided the immobilized oligonucleotide/polynucleotides. It would also be 

possible to provide samples from the nucleic acid of the pathogen itself. 
5 According to the method the healthy control sample is 

contacted with one set of the healthy test surface and the infection test surface 
* described above for a time sufficient to permit detectable hybridization to occur 

between the sample and the immobilized defined oligonucleotide/polynucleotides on 

each surface. The results of this hybridization are a first hybridization pattern formed 
10 between the nucleotides of healthy control and the healthy test surface and a second 

hybridization pattern formed between the nucleotides of healthy control sample and 

the infection test surface. 

In a similar manner, the infection sample is detectably 

hybridized to another set of healthy test and infection test surfaces, forming a third 
15 hybridization pattern between the infection sample and healthy test surface and a 

fourth hybridization pattern between the infection sample and the infection test 

surface. 

Comparing the four hybridization patterns permits detection of 
those defined oligonucleotide/polynucleotides which are differentially expressed 

20 between the healthy animal and the animal infected with the pathogen by the presence 
of differences in the hybridization patterns at pre-defined regions. As mentioned 
differential expression is not required and simple qualitative analysis is possible by 
reference to gene expression which is simply present or absent. 

A second embodiment of this method parallels the second 

25 embodiment of the method as applied to disease above, i.e., the same process is 
employed, with the exception that plurality of defined oligonucleotide/polynucleotide 
sequences forming the healthy test sample surface and the infection test sample 
surface are immobilized on a single solid support. The resulting first hybridization 
pattern (healthy control sample with healthy/infection test sample) and second 

30 hybridization pattern (infection sample with healthy/infection test sample) permits 
detection of those defined oligonucleotide/polynucleotides which are differentially 
expressed between the healthy control and the infection sample by the presence of 
differences in the hybridization patterns at pre-defined regions. The 
oligonucleotide/polynucleotides on each surface which correspond to the pattern 

35 differences may be readily identified with the corresponding ESTs from which the 
oligonucleotide/polynucleotides are obtained. 

As described above for the methods for identifying differential 
gene expression between diseased and healthy animals, the 
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oligonucleotide/polynucleotides on each surface which correspond to the pattern 
differences may be readily identified with the corresponding ESTs from which the 
oligonucleotide/polynucleotide sequences are obtained and the genes expressed by the 
pathogen identified for similar purposes. Other embodiments of these methods may 

5 be developed with resort to the teaching herein, by altering the samples which provide 
the defined oligonucleotide/polynucleotides. For example, an EST, identified with a 
differentially expressed gene by the method of this invention is also useful in 
detecting genes expressed in the various stages of an pathogen's development, 
particularly the infective stage and following the cours of drug treatment and 

10 emergence of resistant variants. For example, employing the techniques described 
above, the EST can be used for detecting a gene in various stages of the parasitic 
Plasmodium species life cycle, which include blood stages, liver stages, and 
gametocyte stages. 

B. Diagnostic Methods 

15 In addition to use of the methods and compositions of this 

invention for identifying differentially expressed genes, another embodiment of this 
invention provides diagnostic methods for diagnosing a selected disease state, or a 
selected state resulting from aging, exposure to drugs or infection in an animal. 
According to this aspect of the invention, a first surface, described as the healthy test 

20 surface above, and a second surface, described as the disease test surface or infection 
test surface, are prepared depending on the disease or infection to be diagnosed. The 
same processes of detectable hybridization to a first and second set of these surfaces 
with the healthy control sample and disease/infection sample are followed to provide 
the four above-described hybridization patterns, i.e., healthy control sample with 

25 healthy test surface; healthy control sample with disease/infection test surface; 
disease/infection sample with healthy test surface; and disease/infection sample with 

disease/infection test surface. 

The diagnosis of disease or infection is provided by comparing 
the four hybridization patterns. Substantial differences between the first and third 
30 hybridization patterns, respectively, and the second and fourth hybridization patterns, 
respectively, indicate the presence of the selected disease or infection in said animal. 
Substantial similarities in the first and third hybridization patterns and second and 
fourth hybridization patterns indicates the absence of disease or infection. 

A similar embodiment utilizes the single surface bearing both 
35 the healthy test surface defined oligonucleotide/polynucleotides and the 
disease/infection test surface defined oligonucleotide/polynucleotides as described 
above. Parallel process steps as described above for detection of genes differentially 
expressed in disease and infected states are followed, resulting in a first hybridization 
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pattern (healthy control sample with single healthy and disease/infection test sample) 
and a second hybridization pattern (disease/infection sample with another copy of the 
single healthy and disease/infection test sample). 

Diagnosis is accomplished by comparing the two hybridization 
5 patterns, wherein substantial differences between the first and second hybridization 
patterns indicate the presence of the selected disease or infection in the animal being 

li( tested. Sub,st^itially similar first jm4 second hybridmtipn patterns indicate the 
absence of disease or infection. This like many of the foregoing embodiments may 
use known or unknown ESTs derived from many libraries. 

10 C. Other Methods of the Invention 

As is obvious to one of skill in the art upon reading this 
disclosure, the compositions and methods of this invention may also be used for other 
similar purposes. For example, the general methods and compositions may be 
adapted easily by manipulation of the samples selected to provide the standardized 

15 defined oligonucleotide/polynucleotides, and selection of the samples selected for 
hybridization thereto. One such modification is the use of this invention to identify 
cell markers of any type, e.g., markers of cancer cells, stem cell markers, and the like. 
Another modification involves the use of the method and compositions to generate 
hybridization patterns useful for forensic identification or an 'expression fingerprint 1 

20 of genes for identification of one member of a species from another. Similarly, the 
methods of this invention may be adapted for use in tissue matching for 
transplantation purposes as well as for molecular histology, i.e., to enable diagnosis of 
disease or disorders in pathology tissue samples such as biopsies. Still another use of 
this method is in monitoring the effects of development and aging upon the gene 

25 expression in a selected animal, by preparing surfaces bearing 
oligonucleotide/polynucleotides prepared from samples of standardized younger 
members of the species being tested. Additionally the patient can serve as an internal 
control by virtue of having the method applied to blood samples every 5-10 years 
during his lifetime. 

30 Still another intriguing use of this method is in the area of 

monitoring the effects of drugs on gene expression, both in laboratories and during 
clinical trials with animal, especially humans. Because the method can be readily 
adapted by altering the above parameters, it can essentially be employed to identify 
differentially expressed genes of any organism, at any stage of development, and 

35 under the influence of any factor which can affect gene expression. 
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/V. The Genes and Proteins Identified 

Application of the compositions and methods of this invention as 
above described also provide other compositions, such as any isolated gene sequence 
which is differentially expressed between a normal healthy animal and an animal 
5 having a disease or infection. Another embodiment of this invention is any isolated 
pathogen gene sequence which is expressed in tissue or cell samples of an infected 
animal. Similarly an embodiment pf Ais invention is any geqe sequence identified by 
the methods described herein. 

These gene sequences may be employed in conventional methods to 

10 produce isolated proteins encoded thereby. To produce a protein of this invention, 
the DNA sequences of a desired gene identified by the use of the methods of this 
invention or portions thereof are inserted into a suitable expression system. 
Desirably, a recombinant molecule or vector is constructed in which the 
polynucleotide sequence encoding the protein is operably linked to a heterologous 

15 expression control sequence permitting expression of the human protein. Numerous 
types of appropriate expression vectors and host cell systems are known in the art for 
mammalian (including human) expression, insect, e.g., baculovirus expression, yeast, 
fungal, and bacterial expression, by standard molecular biology techniques. 

The transfection of these vectors into appropriate host cells, whether 

20 mammalian, bacterial, fungal, or insect, or into appropriate viruses, can result in 
expression of the selected proteins. Suitable host cells or cell lines for transfection, 
and viruses, as well as methods for the construction and transfection of such host cells 
and viruses are well-known. Suitable methods for transfection, culture, amplification, 
screening, and product production and purification are also known in the art. 

25 The genes and proteins identified by this invention can be employed, if 

desired in diagnostic compositions useful for the diagnosis of a disease or infection 
using conventional diagnostic assays. For example, a diagnostic reagent can be 
developed which detectably targets a gene sequence or protein of this invention in a 
biological sample of an animal. Such a reagent may be a complementary nucleotide 

30 sequence, an antibody (monoclonal, recombinant or polyclonal), or a chemically 
derived agonist or antagonist Alternatively, the proteins and polynucleotide 
sequences of this invention, fragments of same, or complementary sequences thereto, 
may themselves be useful as diagnostic reagents for diagnosing disease states with 
which the ESTs of the invention are associated. These reagents may optionally be 

35 labelled using diagnostic labels, such as radioactive labels, colorimetric enzyme label 
systems and the like conventionally used in diagnostic or therapeutic methods, e.g, 
Northern and Western blotting, antigen-antibody binding and the like. The selection 
of the appropriate assay format and label system is within the skill of the art and may 
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readily be chosen without requiring additional explanation by resort to the wealth of 
art in the diagnostic area. 

Additionally, genes and proteins identified according to this invention 
may be used therapeutically. For example, the EST-containing gene sequences may 
5 be useful in gene therapy, to provide a gene sequence which in a disease is not 
properly or sufficiently expressed; In such a method, a selected gene sequence of this 
invention is introduced into a suitable vector or other delivery system for delivery to a » 
cell containing a defect in the selected gene. Suitable delivery systems are well 
known to those of skill in the art and enable the desired EST or gene to be 

10 incorporated into the target cell and to be translated by the cell. The EST or gene 
sequence may be introduced to mutate the existing gene by recombination or provide 
an active copy thereof in addition to the inactive gene to replace its function. 

Alternatively, a protein encoded by an EST or gene of the invention 
may be useful as a therapeutic reagent for delivery of a biologically active protein, 

15 particularly when the disease state is associated with a deficiency of this protein. 
Such a protein may be incorporated into an appropriate therapeutic formulation, alone 
or in combination with other active ingredients. Methods of formulating such 
therapeutic compositions, as well as suitable pharmaceutical carriers, and the like, are 
well known to those of skill in the art. Still an additional method of delivering the 

20 missing protein encoded by an EST, or the gene from which a selected EST was 
derived, involves expressing it directly in vivo. Systems for such in vivo expression 
are well known in the art 

Yet another use of the ESTs, genes identified according to the methods 
of this invention, or the proteins encoded thereby is a target for the screening and 

25 development of natural or synthetic chemical compounds which have utility as 
therapeutic drugs for the treatment of disease states associated with the identified 
genes and ESTs derived therefrom. As one example, a compound capable of binding 
to such a protein encoded by such a gene and either preventing or enhancing its 
biological activity may be a useful drug component for the treatment or prevention of 

30 such disease states. 

Conventional assays and techniques may be used for the screening and 
development of such drugs. As one example, a method for identifying compounds 
which specifically bind to or inhibit or activate proteins encoded by these gene 
sequences can include simply the steps of contacting a selected protein or gene 

35 product, with a test compound to permit binding of the test compound to the protein; 
and determining the amount of test compound, if any, which is bound to the protein. 
Such a method may involve the incubation of the test compound and the protein 
immobilized on a solid support. Still other conventional methods of drug screening 
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can involve employing a suitable computer program to determine compounds having 
similar or complementary chemical structures to that of the gene product or portions 
thereof and screening those compounds either for competitive binding to the protein 
to detect enhanced or decreased activity in the presence of the selected compound. 
5 Thus, through use of such methods, the present invention is anticipated 

to provide compounds capable of interacting with these genes, ESTs, or encoded 
proteins, or fragments thereof, and either enhancing or decreasing the biological 
activity, as desired. Such compounds are believed to be encompassed by this 
invention. 

10 Numerous modifications and variations of the present invention are 

included in the above-identified specification and are expected to be obvious to one of 
skill in the art Such modifications and alterations to the compositions and processes 
of the present invention are believed to be encompassed in the scope of the claims 
appended hereto. 
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WHAT IS CLAIMED IS: 

1 . A method for identifying genes which are differentially expressed in 
two different pre-determined states of an organism comprising: 
5 a. providing a first surface on which is immobilized at pre-defined 

regions on said surface a plurality of defined oligonucleotide/polynucleotide 
sequences, each sequence selected from the group consisting of a fragment of an EST, 
an entire EST a fragment of a gene or an entire gene, isolated from a DNA library 
prepared from at least one selected cell, tissue, organ or organism sample in a first 
10 state and present in excess relative to the polynucleotide to be hybridized; 

b. providing a second surface on which is immobilized at pre-defined 
regions on said surface a plurality of defined oligonucleotide/polynucleotide 
sequences, each sequence selected from the group consisting of a fragment of an EST, 
an entire EST a fragment of a gene or an entire gene, isolated from a DNA library 

15 prepared from at least one selected cell, tissue, organ or organism sample in a second 
state and present in excess relative to the polynucleotide to be hybridized; 

c. detectably hybridizing to a set of said first and second surfaces 
polynucleotide sequences isolated from a sample from a said organism in said first 
state, said sample selected from sources analogous to the sources of step (a), said 

20 hybridization sufficient to form a first and second hybridization pattern on each said 
first and second surface, 

d. detectably hybridizing to a set of said first and second surfaces 
polynucleotide sequences isolated from a sample from said organism in said second 
state, said sample selected from sources analogous to the sources of step (c), said 

25 hybridization sufficient to form a third and fourth hybridization pattern on each said 
first and second surface, 

e. comparing at least two of the four hybridization patterns, 
wherein genes differentially expressed in said first and second states are identified by 
the presence of differences in the hybridization patterns at pre-defined regions; 

30 f. identifying the oligonucleotide/polynucleotides on each surface 

which correspond to said pattern differences and the corresponding ESTs or larger 
gene fragment from which the oligonucleotide/polynucleotides were obtained, 
whereby identification of the EST or larger gene fragment permits identification of 
the gene from which the ESTs or larger gene fragment were derived. 

35 
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2. The method according to Claim 1 wherein said first and second states are 
respectively healthy and disease; pathogen uninfected and pathogen infected; a first 
progression state and a second progression of a disease or infection; a first treatment 
state and a second treatment state of a disease or infection; or a first developmental 

5 and a second developmental state. 

3. The method according to Claim 1 wherein said organism is a plant or an 

animal. 

10 4. The method according to Claim 3 wherein said aniaml is a human. 

5. A method for identifying genes which are differentially expressed in a 
normal healthy animal and an animal having a disease comprising: 

a. providing a first surface on which is immobilized at pre- 
15 defined regions on said surface a plurality of defined oligonucleotide/polynucleotide 

sequences, each sequence each sequence selected from the group consisting of a 
fragment of an EST, an entire EST a fragment of a gene or an entire gene, isolated 
from a DNA library prepared from at least one selected cell, tissue, organ or organism 
sample in a healthy animal and present in excess relative to the polynucleotide to be 
20 hybridized; 

b. providing a second surface on which is immobilized at pre- 
defined regions of said surface a plurality of defined oligonucleotide/polynucleotide 
sequences, each sequence each sequence selected from the group consisting of a 
fragment of an EST, an entire EST a fragment of a gene or an entire gene, isolated 

25 from a DNA library prepared from at least one selected cell, tissue, organ or organism 
sample from an animal having said disease and present in excess relative to the 
polynucleotide to be hybridized; 

c. detectably hybridizing to a set of said first and second surfaces 
polynucleotide sequences isolated from a sample from a healthy animal, said sample 

30 selected from sources analogous to the sources of step (a), said hybridization 
sufficient to form a first and second hybridization pattern on each said first and 
second surface, said sample selected from a cell or tissue sample analogous to the 
sample of step (a), said hybridization sufficient to form a first and second 
hybridization pattern on each said first and second surface; 
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d. detectably hybridizing to a set of said first and second surfaces 
polynucleotide sequences isolated from a sample from an animal having said disease, 
said sample selected from a cell or tissue sample analogous to the sample of step (c), 
said hybridization sufficient to form a third and fourth hybridization pattern on each 

5 said first and second surface, 

e. comparing at least two of the four hybridization patterns, 
wherein genes differentially expressed in said first and second states are identified by 
the presence of differences in the hybridization patterns at pre-defined regions; 

f . identifying the oligonucleotide/polynucleotides on each surface 
10 which correspond to said pattern differences and the corresponding ESTs or larger 

gene fragment from which the oligonucleotide/polynucleotides were obtained, 
whereby identification of the EST or larger gene fragment permits identification of 
the gene from which the ESTs or larger gene fragment were derived. 

15 6. A method for identifying genes which are differentially expressed in a 

normal healthy animal and an animal having a disease comprising: 

a. providing a surface on which is immobilized at pre-defined 
regions on said surface a plurality of defined oligonucleotide/polynucleotide 
sequences, each sequence selected from the group consisting of a fragment of an EST, 

20 an entire EST a fragment of a gene or an entire gene isolated from a DNA library 
prepared from the group selected from at least one selected cell, tissue, organ or 
organism sample in of a healthy animal and an analogous selected sample of an 
animal having said disease and both present in excess relative to the polynucleotide to 
be hybridized; 

25 b. detectably hybridizing to a first copy of said surface 

polynucleotide sequences isolated from a healthy animal, said sample selected from a 
cell or tissue sample analogous to the sample of step (a), said hybridization sufficient 
to form a first hybridization pattern on said surface; 

c. detectably hybridizing to a second copy of said surface 
30 polynucleotide sequences isolated from an animal having said disease, said sample 

selected from a cell or tissue sample analogous to the sample of step (a), said 
hybridization sufficient to form a second hybridization pattern on said surface; 

d. comparing the two hybridization patterns, wherein genes 
differentially expressed in a disease state are identified by the presence of differences 

35 in the hybridization patterns at pre-defined regions; 
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e. identifying the oligonucleotide/polynucleotides on each surface 
which correspond to said pattern differences and the corresponding ESTs from which 
the oligonucleotide/polynucleotides are obtained, whereby identification of the EST 
permits identification of the gene from which the ESTs were derived. 

5 "' " ' 

7. A method for identifying a gene of a pathogen which is expressed in a 
biological sample of an animal infected with said pathogen comprising: 

a. providing a first surface on which is immobilized at pre- 
defined regions on said surface a plurality of defined oligonucleotide/polynucleotide 
10 sequences, each sequence selected from the group consisting of a fragment of an EST, 
an entire EST a fragment of a gene or an entire gene isolated from a DNA library 
prepared from at least one selected cell, tissue, organ or organism sample of a 
healthy, uninfected animal and present in excess relative to the polynucleotide to be 
hybridized; 

15 b. providing a second surface on which is immobilized at pre- 

defined regions of said surface a plurality of defined oligonucleotide/polynucleotide 
sequences, each sequence selected from the group consisting of a fragment of an EST, 
an entire EST a fragment of a gene or an entire gene isolated from at least one 
selected cell, tissue, organ or organism sample of an infected animal; 

20 c. detectably hybridizing to a set of said first and second surfaces 

polynucleotide sequences isolated from a sample from a healthy animal, said sample 
selected from a cell or tissue sample analogous to the sample of step (a), said 
hybridization sufficient to form first and second hybridization patterns on each said 
first and second surface, 

25 d. detectably hybridizing to a set of said first and second surfaces 

polynucleotide sequences isolated from a sample from an infected animal, said 
sample selected from a cell or tissue-sample analogous to the sample of step (a), said 
hybridization sufficient to form third and fourth hybridization patterns on each said 
first and second surface, 

30 e. comparing the four hybridization patterns, wherein genes of 

said pathogen which are expressed in an infected animal are identified by the 
presence of differences in the hybridization patterns at pre-defined regions; 

f. identifying the oligonucleotide/polynucleotides on each surface 
which correspond to said pattern differences and the corresponding ESTs from which 

35 the oligonucleotide/polynucleotides are obtained, whereby identification of the EST 
permits identification of the gene from which the ESTs were derived. 
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8. A method for identifying a gene of a pathogen which is expressed in a 
biological sample of an animal infected with said pathogen comprising: 

a. providing a surface on which is immobilized at pre-defined 
regions on said surface a plurality of defined oligonucleotide/polynucleotide 

5 sequences, each sequence selected from the group consisting of a fragment of an EST, 
an entire EST a fragment of a gene or an entire gene isolated from a DNA library 
prepared from the group selected from at least one selected cell, tissue, organ or 
organism sample in of a healthy animal and an analogous selected sample of an 
animal having said disease and both present in excess relative to the polynucleotide to 
10 be hybridized 

b. detectably hybridizing to a first copy of said surface 
polynucleotide sequences isolated from a sample from a healthy animal, said sample 
selected from a cell or tissue sample analogous to the sample of step (a), said 
hybridization sufficient to form a first hybridization pattern on said surface; 

15 c. detectably hybridizing to a second copy of said surface 

polynucleotide sequences isolated from a sample from an infected animal, said 
sample selected from a cell or tissue sample analogous to the sample of step (a), said 
hybridization sufficient to form a second hybridization pattern on said surface; 

d. comparing the two hybridization patterns, wherein genes of 
20 said pathogen which are expressed in an infected animal are identified by the 

presence of differences in the hybridization patterns at pre-defined regions; 

e. identifying the oligonucleotide/polynucleotides on each surface 
which correspond to said pattern differences and the corresponding ESTs from which 
the oligonucleotide/polynucleotides are obtained, whereby identification of the EST 

25 permits identification of the gene from which the ESTs were derived. 

9, A composition suitable for use in hybridization comprising a solid 
surface on which is immobilized at pre-defined regions on said surface a plurality of 
defined oligonucleotide/polynucleotide sequences for hybridization, each sequence 

30 selected from the group consisting of a fragment of an EST, an entire EST a fragment 
of a gene or an entire gene isolated from a DNA library prepared from the group 
selected from at least one selected cell, tissue, organ or organism sample of a healthy 
animal, at least one analogous sample of said animal having a disease, at least one 
analogous sample of said animal infected with a microbial pathogen, and any 

35 combination thereof. 
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10. An isolated gene sequence which is differentially expressed in a 
normal healthy animal and an animal having a disease, identified by the method of 

i 

claim 1. 

11. An isolated pathogen gene sequence which is expressed in tissue or 
cell samples of an infected animal identified by the method of claim 7. 

12. A diagnostic composition useful for the diagnosis of a disease 
comprising a reagent capable of detectably targeting a gene sequence of claim 10 in a 
biological sample of an animal. 

13. A diagnostic composition useful for the diagnosis of infection by a 
pathogen comprising a reagent capable of detectably targeting a gene sequence of 
claim 1 1 in a biological sample of an animal. 

14. An isolated protein produced by expression of a gene sequence of 
claim 10. 

15. An isolated pathogen protein produced by expression of a gene 
sequence of claim 11. 

16. A therapeutic composition comprising a protein or fragment thereof 
selected from the group consisting of a protein of claim 10 and a protein of claim IS. 

17. A method for diagnosing a selected disease or infection in an animal 
comprising: 

a. providing a first surface on which is immobilized at pre- 
defined regions on said surface a plurality of defined oligonucleotide/polynucleotide 
sequences, each sequence selected from the group consisting of a fragment of an EST, 
an entire EST a fragment of a gene or an entire gene, isolated from a DNA library 
prepared from at least one selected cell, tissue, organ or organism sample of a healthy 
animal and present in excess relative to the polynucleotide to be hybridized; 

b. providing a second surface on which is immobilized at pre- 
defined regions of said surface a plurality of defined oligonucleotide/polynucleotide 
sequences, each sequence comprising a fragment of an EST isolated from at least one 
said tissue of an animal having said disease; 
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c. detectably hybridizing to a set of said first and second surfaces 
polynucleotide sequences isolated from a DNA library prepared from a sample from a 
healthy animal, said sample selected from a cell or tissue sample analogous to the 
sample of step (a), said hybridization sufficient to form a first and second 

5 hybridization pattern on each said first and second surface; 

d. detectably hybridizing to a set of said first and second surfaces 
polynucleotide sequences isolated from a DNA library prepared from a sample from 
an animal having said disease, said sample selected from a cell or tissue sample 
analogous to the sample of step (c), said hybridization sufficient to form a third and 

10 fourth hybridization pattern on each said first and second surface; 

e. comparing the four hybridization patterns, wherein substantial 
differences between the first and third hybridization patterns and the second and 
fourth hybridization patterns indicates the presence of said selected disease or 
infection in said animal, and substantial similarities in said first and third 

15 hybridization patterns and second and fourth hybridization patterns indicates the 
absence of disease or infection. 

18. A method for diagnosing a selected disease or infection in an animal 
comprising: 

20 a. providing a surface on which is immobilized at pre-defined 

regions on said surface a plurality of defined oligonucleotide/polynucleotide 
sequences, each sequence comprising a fragment of an EST isolated from a DNA 
library prepared from the group consisting of a selected cell or tissue sample of a 
healthy animal and an analogous selected cell or tissue sample of an animal having 

25 said disease; 

b. detectably hybridizing to a first copy of said surface 
polynucleotide sequences isolated from a sample from a healthy animal, said sample 
selected from a cell or tissue sample analogous to the sample of step (a), said 
hybridization sufficient to form a first hybridization pattern on said surface; 

30 c. detectably hybridizing to a second copy of said surface 

polynucleotide sequences isolated from a DNA library prepared from a sample from 
an animal having said disease, said sample selected from a cell or tissue sample 
analogous to the sample of step (a), said hybridization sufficient to form a second 
hybridization pattern on said surface; 

35 d. comparing the two hybridization patterns, wherein substantial 

differences between the first and second hybridization patterns indicates the presence 
of said selected disease or infection in said animal, and substantial similarities in said 
first and second hybridization patterns indicates the absence of disease or infection. 
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COMPARATIVE GENE TRANSCRIPT ANALYSIS 
1. FIELD OF INVENTION 

The present invention is in the field of molecular 
biology and computer science; more particularly, the 
5 present invention describes methods of analyzing gene 

transcripts and diagnosing the genetic expression of cells 
and tissuei. 

2. BACKGROUND OF THE INVENTION 

Until very recently, the history of molecular biology 

10 has been written one gene at a time. Scientists have 
observed the cell's physical changes, isolated mixtures 
from the cell or its milieu, purified proteins, sequenced 
proteins and therefrom constructed probes to look for the 
. corresponding gene. 

15 Recently, different nations have set up massive 

projects to sequence the billions of bases in the human 
genome. These projects typically begin with dividing the 
genome into large portions of chromosomes and then 
determining the sequences of these pieces, which are then 

20 analyzed for identity with known proteins or portions 

thereof, known as motifs. Unfortunately, the majority of 
genomic DNA does not encode proteins and though it is 
postulated to have some effect on the cell's ability to 
make protein, its relevance to medical applications is not 

25 understood at this time. 

A third methodology involves sequencing only the 
transcripts encoding the cellular machinery actively 
involved in making protein, namely the mRNA. The advantage 
is that the cell has already edited out all the non-coding 

30 DNA, and it is relatively easy to identify the protein- 
coding portion of the RNA. The utility of this approach 
was not immediately obvious to genomic researchers. In 
fact, when cDNA sequencing was initially proposed, the 
method was roundly denounced by those committed to genomic 

35 sequencing. For example, the head of the U.S. Human Genome 
project discounted CDNA sequencing as not valuable and 
refused to approve funding of projects. 

In this disclosure, we teach methods for analyzing 
DNA, including cDNA libraries. Based on our analyses and 
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research, we see each individual gene product as a "pixel" 
of information, which relates to the expression of that, 
and only that, gene. We teach herein, methods whereby the 
individual "pixels" of gene expression information can be 
5 combined into a single gene transcript "image," in which 
each of the individual genes can be visualized 
simultaneously and allowing relationships between the gene 
pixels to be easily visualized and understood. 

We further teach a new method which we call electronic 

10 subtraction. Electronic subtraction will enable the gene 
researcher to turn a single image into a moving picture, 
one which describes the temporality or dynamics of gene 
expression, at the level of a cell or a whole tissue. It 
is that sense of "motion" of cellular machinery on the 

15 scale of a cell or organ which constitutes the new 

invention herein. This constitutes a new view into the 
process of living cell physiology and one which holds great 
promise to unveil and discover new therapeutic and 
diagnostic approaches in medicine. 

20 We teach another method which we call "electronic 

northern," which tracks the expression of a single gene 
across many types of cells and tissues. 

Nucleic acids (DNA and RNA) carry within their 
sequence the hereditary information and are therefore the 

25 prime molecules of life. Nucleic acids are found in all 

living organisms including bacteria, fungi, viruses, plants 
and animals. It is of interest to determine the relative 
abundance of different discrete nucleic acids in different 
cells, tissues and organisms over time under various 

30 conditions, treatments and regimes. 

All dividing cells in the human body contain the same 
set of 23 pairs of chromosomes. It is estimated that these 
autosomal and sex chromosomes encode approximately 100,000 
genes. The differences among different types of cells are 

35 believed to reflect the differential expression of the 
100,000 or so genes. Fundamental questions of biology 
could be answered by understanding which genes are 
transcribed and knowing the relative abundance of 
transcripts in different cells. 
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Previously, the art has only provided for the analysis 
of a few known genes at a time by standard molecular 

i 

biology techniques such as PCR, northern blot analysis, or 
other types of DNA probe analysis such as in situ 
5 hybridization. Each of these methods allows one to analyze 
the transcription of only known genes and/ or small numbers 
of genes at a time. Nucl. Acids Res. 19, 7097-7104 (1991); 
Nucl, Acids Res. 18, 4833-42 (1990); Nucl. Acids Res. 18. 
2789-92 (1989) ; European J. Neuroscience 2, 1063-1073 

10 (1990); Analytical Biochem. 187, 364-73 (1990); Genet. 
Annals Techn. Appl. 7, 64-70 (1990); GATA 8(4), 129-33 
(1991); Proc. Natl. Acad. Sci. USA 85, 1696-1700 (1988); 
Nucl. Acids Res. 19, 1954 (1991); Proc. Natl. Acad. Sci. 
USA 88, 1943-47 (1991); Nucl. Acids Res. 19, 6123-27 

15 (1991); Proc. Natl. Acad. Sci. USA 85, 5738-42 (1988); 
Nucl. Acids Res. 16, 10937 (1988) . 

Studies of the number and types of genes whose 
transcription is induced or otherwise regulated during cell 
processes such as activation, differentiation, aging, viral 

20 transformation, morphogenesis, and mitosis have been 

pursued for many years, using a variety of methodologies. 
One of the earliest methods was to isolate and analyze 
levels of the proteins in a cell, tissue, organ system, or 
even organisms both before and after the process of 

25 interest. One method of analyzing multiple proteins in a 
sample is using 2-dimensional gel electrophoresis, wherein 
proteins can be, in principle, identified and quantified as 
individual bands, and ultimately reduced to a discrete 
signal. At present, 2-dimensional analysis only resolves 

30 approximately 15% of the proteins. In order to positively 
analyze those bands which are resolved, each band must be 
excised from the membrane and subjected to protein sequence 
analysis using Edman degradation. Unfortunately, most of 
the bands were present in quantities too small to obtain a 

35 reliable sequence, and many of those bands contained more 
than one discrete protein. An additional difficulty is 
that many of the proteins were blocked at the 
amino-terminus, further complicating the sequencing 
process. 
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Analyzing differentiation at the gene transcription 
level has overcome many of these disadvantages and 
drawbacks, since the power of recombinant DNA technology 
allows amplification of signals containing very small 
5 amounts of material. The most common method, called 
"hybridization subtraction," involves isolation of mRNA 
from the biological specimen before (B) and after (A) the 
developmental process of interest, transcribing one set of 
mRNA into cDNA, subtracting specimen B from specimen A 
10 (mRNA from cDNA) by hybridization, and constructing a cDNA 
library from the non-hybridizing mRNA fraction. Many 
different groups have used this strategy successfully, and 
a variety of procedures have been published and improved 
upon using this same basic scheme. Nucl. Acids Res. 19, 
15 7097-7104 (1991); Nucl. Acids Res. 18, 4833-42 (1990); 
• Nucl. Acids Res. 18, 2789-92 (1989); European J. 
Neuroscience 2, 1063-1073 (1990); Analytical Biochem. 187 , 
364-73 (1990); Genet. Annals Techn. Appl. 7, 64-70 (1990); 
GAT A 8,(4) , 129-33 (1991); Proc. Natl. Acad. Sci. USA 85, 
20 1696-1700 (1988); Nucl. Acids Res. 19, 1954 (1991); Proc. 
Natl. Acad. Sci. USA 88/ 1943-47 (1991); Nucl. Acids Res. 
19, 6123-27 (1991); Proc. Natl. Acad. Sci. USA 85, 5738-42 
(1988); Nucl. Acids Res. JL6, 1093? (1988). 

Although each of these techniques have particular 
25 strengths and weaknesses, there are still some limitations 
and undesirable aspects of these methods: First, the time 
and effort required to construct such libraries is quite 
large. Typically, a trained molecular biologist might 
expect construction and characterization of such a library 
30 to require 3 to 6 months, depending on the level of skill, 
experience, and luck. Second, the resulting subtraction 
libraries are typically inferior to the libraries 
constructed by standard methodology. A typical 
conventional cDNA library should have a clone complexity of 
35 at least 10 6 clones, and an average insert size of 1-3 kB. 
In contrast, subtracted libraries can have complexities of 
10 2 or 10 3 and average insert sizes of 0.2 kB. Therefore, 
there can be a significant loss of clone and sequence 
information associated with such libraries. Third, this 
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approach allows the researcher to capture only the genes 
induced in specimen A relative to specimen B, not 
vice-versa, nor does it easily allow comparison to a third 
specimen of interest (C) . Fourth, this approach requires 
5 very large amounts (hundreds of micrograms) of "driver" 
mRNA (specimen B) , which significantly limits the number 
and type of subtractions that are possible since many 
tissues and cells are very difficult to obtain in large 
quantities. 

10 Fifth, the resolution of the subtraction is dependent 

upon the physical properties of DNA:DNA or RNA:DNA 
hybridization. The ability of a given sequence to find a 
hybridization match is dependent on its unique CoT value. 
The CoT value is a function of the number of copies 

15 (concentration) of the particular sequence, multiplied by 
the time of hybridization. It follows that for sequences 
which are abundant, hybridization events will occur very 
rapidly (low CoT value) , while rare sequences will form 
duplexes at very high CoT values. CoT values which allow 

20 such rare sequences to form duplexes and therefore be 
effectively selected are difficult to achieve in a 
convenient time frame. Therefore, hybridization 
subtraction is simply not a useful technique with which to 
study relative levels of rare mRNA species. Sixth, this 

25 problem is further complicated by the fact that duplex 
formation is also dependent on the nucleotide base 
composition for a given sequence. Those sequences rich in 
G + C form stronger duplexes than those with high contents 
of A + T. Therefore, the former sequences will tend to be 

30 removed selectively by hybridization subtraction. Seventh, 
it is possible that hybridization between nonexact matches 
can occur. When this happens, the expression of a 
homologous gene may "mask" expression of a gene of 
interest, artificially skewing the results for that 

35 particular gene. 

Matsubara and Okubo proposed using partial cDNA 
sequences to establish expression profiles of genes which 
could be used in functional analyses of the human genome. 
Matsubara and Okubo warned against using random priming, as 
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it creates multiple unique DNA fragments from individual 
mRNAs and may thus skew the analysis of the number of 
particular mRNAs per library. They sequenced randomly 
selected members from a 3 '-directed cDNA library and 
5 established the frequency of appearance of the various 
ESTs. They proposed comparing lists of ESTs from various 
cell types to classify genes. Genes expressed in many 
different cell types were labeled housekeepers and those 
selectively expressed in certain cells were labeled cell- 

10 specific genes, even in the absence of the full sequence of 
the gene or the biological activity of the gene product. 

The present invention avoids the drawbacks of the 
prior art by providing a method to quantify the relative 
abundance of multiple gene transcripts in a given 

15 biological specimen by the use of high-throughput 

sequence-specific analysis of individual RNAs and/or their 
corresponding cDNAs. 

The present invention offers several advantages over 
current protein discovery methods which attempt to isolate 

20 individual proteins based upon biological effects. The 
method of the instant invention provides for detailed 
diagnostic comparisons of cell profiles revealing numerous 
changes in the expression of individual transcripts. 

The instant invention provides several advantages over 

25 current subtraction methods including a more complex 
library analysis (io 6 to 10 7 clones as compared to 10 3 
clones) which allows identification of low abundance 
messages as well as enabling the identification of messages 
which either increase or decrease in abundance. These 

30 large libraries are very routine to make in contrast to the 
libraries of previous methods. In addition, homologues can 
easily be distinguished with the method of the instant 
invention. 

This method is very convenient because it organizes a 
35 large quantity of data into a comprehensible, digestible 
format. The most significant differences are highlighted 
by electronic subtraction. In depth analyses are made more 
convenient. 
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The present invention provides several advantages over 
previous methods of electronic analysis of cDNA. The 
method is particularly powerful when more than 100 and 
preferably more than 1,000 gene transcripts are analyzed. 
5 In such a case, new low-frequency transcripts are 
discovered and tissue typed. 

High resolution analysis of gene expression can be 
used directly as a diagnostic profile or to identify 
disease-specific genes for the development of more classic 
10 diagnostic approaches. 

This process is defined as gene transcript frequency 
analysis. The resulting quantitative analysis of the gene 
transcripts is defined as comparative gene transcript 
analysis. 

15 3. SUMMARY OF THE INVENTION 

The invention is a method of analyzing a specimen 
containing gene transcripts comprising the steps of (a) 
producing a library of biological sequences; (b) generating 
a set of transcript sequences, where each of the transcript 

20 sequences in said set is indicative of a different one of 
the biological sequences of the library; (c) processing the 
transcript sequences in a programmed computer (in which a 
database of reference transcript sequences indicative of 
reference sequences is stored) , to generate an identified 

25 sequence value for each of the transcript sequences, where 
each said identified sequence value is indicative of 
sequence annotation and a degree of match between one of 
the biological sequences of the library and at least one of 
the reference sequences; and (d) processing each said 

30 identified sequence value to generate final data values. 

indicative of the number of times each identified sequence 
value is present in the library. 

The invention also includes a method of comparing two 
specimens containing gene transcripts. The first specimen 

35 is processed as described above. The second specimen is 
used to produce a second library of biological sequences, 
which is used to generate a second set of transcript 
sequences, where each of the transcript sequences in the 
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In a further embodiment, the relative abundance of the 
gene transcripts in one cell type or tissue is compared 
with the relative abundance of gene transcript numbers in a 
second cell type or tissue in order to identify the 
5 differences and similarities. 

In a further embodiment, the method includes a system 
for analyzing a library of biological sequences including a 
means for receiving a set of transcript sequences, where 
each of the transcript sequences is indicative of a 

10 different one of the biological sequences of the library; 
and a means for processing the transcript sequences in a 
computer system in which a database of reference transcript 
sequences indicative of reference sequences is stored, 
wherein the computer is programmed with software for 

15 generating an identified sequence value for each of the 
transcript sequences, where each said identified sequence 
value is indicative of a sequence annotation and the degree 
of match between a different one of the biological 
sequences of the library and at least one of the reference 

20 sequences, and for processing each said identified sequence 
value to generate final data values indicative of the 
number of times each identified sequence value is present 
in the library. 

In essence, the invention is a method and system for 

25 quantifying the relative abundance of gene transcripts in a 
biological specimen. The invention provides a method for 
comparing the gene transcript image from two or more 
different biological specimens in order to distinguish 
between the two specimens and identify one or more genes 

30 which are differentially expressed between the two 
specimens. Thus, this gene transcript image and its 
comparison can be used as a diagnostic. One embodiment of 
the method generates high-throughput sequence-specific 
analysis of multiple RNAs or their corresponding cDNAs: a 

35 gene transcript image. Another embodiment of the method 

produces the gene transcript imaging analysis by the use of 
high-throughput cDNA sequence analysis. In addition, two 
or more gene transcript images can be compared and used to 
detect or diagnose a particular biological state, disease, 
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or condition which is correlated to the relative abundance 
of gene transcripts in a given cell or population of cells. 

4. DESCRIPTION OF THE TABLES AND DRAWINGS 

4.1. TABLES 

5 Table 1 presents a detailed explanation of the letter 

codes utilized in Tables 2-5. 

Table 2 lists the one hundred most common gene 
transcripts. It is a partial list of isolates from the 
HUVEC cDNA library prepared and sequenced as described 

10 below. The left-hand column refers to the sequence's order 
of abundance in this table. The next column labeled 
"number" is the clone number of the first HUVEC sequence 
identification reference matching the sequence in the 
"entry" column number. Isolates that have not been 

15 sequenced are not present in Table 2. The next column, 

labeled "N" , indicates the total number of cDNAs which have 
the same degree of match with the sequence of the reference 
transcript in the "entry" column. 

The column labeled "entry" gives the NIH GENBANK locus 

20 name, which corresponds to the library sequence numbers. 
The "s" column indicates in a few cases the species of the 
reference sequence. The code for column "s" is given in 
Table 1. The column labeled "descriptor" provides a plain 
English explanation of the identity of the sequence 

25 corresponding to the NIH GENBANK locus name in the "entry" 
column. 

Table 3 is a comparison of the top fifteen most 
abundant gene transcripts in normal monocytes and activated 
macrophage cells. 

30 Table 4 is a detailed summary of library subtraction 

analysis summary comparing the THP-1 and human macrophage 
cDNA sequences. In Table 4 , the same code as in Table 2 is 
used. Additional columns are for "bgfreq" (abundance 
number in the subtractant library) , "rfend" (abundance 

35 number in the target library) and "ratio" (the target 
abundance number divided by the subtractant abundance 
number) . As is clear from perusal of the table, when the 
abundance number in the subtractant library is "0", the 
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target abundance number is divided by 0.05. This is a way 
of obtaining a result (not possible dividing by 0) and 
distinguishing the result from ratios of subtractant 
numbers of 1. 

5 Table 5 is the computer program, written in source 

code, for generating gene transcript subtraction profiles. 

Table 6 is a partial listing of database entries used 
in the electronic northern blot analysis as provided by the . 
present invention. 

4,2. BRIEF DESCRIPTION OF THE DRAWINGS 

Figure 1 is a chart summarizing data collected and 
stored regarding the library construction portion of 
sequence preparation and analysis. 

15 Figure 2 is a diagram representing the sequence of 

operations performed by "abundance sort" software in a 
class of preferred embodiments of the inventive method. 

Figure 3 is a block diagram of a preferred embodiment 
of the system of the invention. 

20 Figure 4 is a more detailed block diagram of the 

bioinf ormatics process from new sequence (that has already 
been sequenced but not identified) to printout of the 
transcript imaging analysis and the provision of database 
subscriptions. 

25 S. DETAILED DESCRIPTION OF THE INVENTION 

The present invention provides a method to compare the 
relative abundance of gene transcripts in different 
biological specimens by the use of high-throughput 
sequence-specific analysis of individual RNAs or their 

30 corresponding cDNAs (or alternatively, of data representing 
other biological sequences) . This process is denoted 
herein as gene transcript imaging. The quantitative 
analysis of the relative abundance for a set of gene 
transcripts is denoted herein as "gene transcript image 

35 analysis" or "gene transcript frequency analysis". The 
present invention allows one to obtain a profile for gene 
transcription in any given population of cells or tissue 
from any type of organism. The invention can be applied to 
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obtain a profile of a specimen consisting of a single cell 
(or clones of a single cell) , or of many cells, or of 
tissue more complex than a single cell and containing 
multiple cell types, such as liver. 
5 The invention has significant advantages in the fields 

of diagnostics, toxicology and pharmacology, to name a few. 
A highly sophisticated diagnostic test can be performed on 
the ill patient in whom a diagnosis has not been made. A 
biological specimen consisting of the patient's fluids or 

1Q ,,/t issues , i s, Qbt. a ix\^ 4 ,#nd , £h e - ge tenser ipt s are i s o 1 a ted 
and expanded to the extent necessary to determine their 
identity. Optionally, the gene transcripts can be 
converted to cDNA. A sampling of the gene transcripts are 
subjected to sequence-specific analysis and quantified. 

15 These gene transcript sequence abundances are compared 
against reference database sequence abundances including 
normal data sets for diseased and healthy patients. The 
patient has the disease (s) with which the patient's data 
set most closely correlates. 

20 For example, gene transcript frequency analysis can be 

used to differentiate normal cells or tissues from diseased 
cells or tissues, just as it highlights differences between 
normal monocytes and activated macrophages in Table 3. 

In toxicology, a fundamental question is which tests 

25 are most effective in predicting or detecting a toxic 

effect. Gene transcript imaging provides highly detailed 
information on the cell and tissue environment, some of 
which would not be obvious in conventional, less detailed 
screening methods. The gene transcript image is a more 

30 powerful method to predict drug toxicity and efficacy. 
Similar benefits accrue in the use of this tool in 
pharmacology. The gene transcript image can be used 
selectively to look at protein categories which are 
expected to be affected, for example, enzymes which 

35 detoxify toxins. 

In an alternative embodiment, comparative gene 
transcript frequency analysis is used to differentiate 
between cancer cells which respond to anti-cancer agents 
and those which do not respond. Examples of anti-cancer 
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agents are tamoxifen, vincristine, vinblastine, 
podophy llotoxins , etoposide , tenisposide , cisplat in , 
biologic response modifiers such as interferon, 11-2, GM- 
CSF, enzymes, hormones and the like. This method also 
5 provides a means for sorting the gene transcripts by 
functional category. In the case of cancer cells, 
transcription factors or other essential regulatory 
molecules are very important categories to analyze across 
different libraries. 

.10 In yet .another embodiment, comparative gene transcript 

frequency analysis is used to differentiate between control 
liver cells and liver cells isolated from patients treated 
with experimental drugs like FIAU to distinguish between 
pathology caused by the underlying disease and that caused 

15 by the drug. 

In yet another embodiment, comparative gene transcript 
frequency analysis is used to differentiate between brain 
tissue from patients treated and untreated with lithium. 
In a further embodiment, comparative gene transcript 

20 frequency analysis is used to differentiate between 
cyclosporin and FK506-treated cells and normal cells. 

In a further embodiment, comparative gene transcript 
frequency analysis is used to differentiate between virally 
infected (including HIV-infected) human cells and 

25 uninfected human cells. Gene transcript frequency analysis 
is also used to rapidly survey gene transcripts in HIV- 
resistant, HIV-infected, and HIV-sensitive cells. 
Comparison of gene transcript abundance will indicate the 
success of treatment and/ or new avenues to study. 

30 In a further embodiment, comparative gene transcript 

frequency analysis is used to differentiate between 
bronchial lavage fluids from healthy and unhealthy patients 
with a variety of ailments. 

In a further embodiment, comparative gene transcript 

35 frequency analysis is used to differentiate between cell, 
plant, microbial and animal mutants and wild-type species. 
In addition, the transcript abundance program is adapted to 
permit the scientist to evaluate the transcription of one 
gene in many different tissues. Such comparisons could 

13 
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identify deletion mutants which do not produce a gene 
product and point mutants which produce a less abundant or 
otherwise different message. Such mutations can affect 
basic biochemical and pharmacological processes, such as 
5 mineral nutrition and metabolism, and can be isolated by 
means known to those skilled in the art. Thus, crops with 
improved yields, pest resistance and other factors can be 
developed. 

In a further embodiment, comparative gene transcript 

10 frequency analysis is used for an interspecies comparative 
analysis which would allow for the selection of better 
pharmacologic animal models. In this embodiment, humans 
and other animals (such as a mouse) , or their cultured 
cells are treated with a specific test agent. The relative 

15 sequence abundance of each cDNA population is determined. 
* If the animal test system is a good model, homologous genes 
in the animal cDNA population should change expression 
similarly to those in human cells. If side effects are 
detected with the drug, a detailed transcript abundance 

20 analysis will be performed to survey gene transcript 

changes. Models will then be evaluated by comparing basic 
physiological changes. 

In a further embodiment, comparative gene transcript 
frequency analysis is used in a clinical setting to give a 

25 highly detailed gene transcript profile of a patient's 
cells or tissue (for example, a blood sample). In 
particular, gene transcript frequency analysis is used to 
give a high resolution gene expression profile of a 
diseased state or condition. 

30 In the preferred embodiment, the method utilizes 

high-throughput cDNA sequencing to identify specific 
transcripts of interest. The generated cDNA and deduced 
amino acid sequences are then extensively compared with 
GENBANK and other sequence data banks as described below. 

35 The method offers several advantages over current protein 
discovery by two-dimensional gel methods which try to 
identify individual proteins involved in a particular 
biological effect. Here, detailed comparisons of profiles 
of activated and inactive cells reveal numerous changes in 
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the expression of individual transcripts. After it is 
determined if the sequence is an "exact" match, similar or 
a non-match, the sequence is entered into a database. 
Next, the numbers of copies of cDNA corresponding to each 
5 gene are tabulated. Although this can be done slowly and 
arduously, if at all, by human hand from a printout of all 
entries, a computer program is a useful and rapid way to 
tabulate this information. The numbers of cDNA copies 
(optionally divided by the total number of sequences in the 

10 data set) provides a .picture of the relative abundance of 
transcripts for each corresponding gene. The list of 
represented genes can then be sorted by abundance in the 
cDNA population. A multitude of additional types of 
comparisons or dimensions are possible and are exemplified 

15 below. 

An alternate method of producing a gene transcript 
image includes the steps of obtaining a mixture of test 
mRNA and providing a representative array of unique probes 
whose sequences are complementary to at least some of the 

20 test mRNAs. Next, a fixed amount of the test mRNA is added 
to the arrayed probes. The test mRNA is incubated with the 
probes for a sufficient time to allow hybrids of the test 
mRNA and probes to form. The mRNA-probe hybrids are 
detected and the quantity determined. The hybrids are 

25 identified by their location in the probe array. The 
quantity of each hybrid is summed to give a population 
number. Each hybrid quantity is divided by the population 
number to provide a set of relative abundance data termed a 
gene transcript image analysis. 

30 6. EXAMPLES 

The examples below are provided to illustrate the 
subject invention. These examples are provided by way of 
illustration and are not included for the purpose of 
limiting the invention. 

35 6.1. TISSUE SOURCES AND CELL LINES 

For analysis with the computer program claimed herein, 
biological sequences can be obtained from virtually any 

15 
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source. Most popular are tissues obtained from the human 
body. Tissues can be obtained from any organ of the body, 
any age donor, any abnormality or any immortalized cell 
line. Immortal cell lines may be preferred in some 
5 instances because of their purity of cell type; other 
tissue samples invariably include mixed cell types. A 
special technique is available to take a single cell (for 
example, a brain cell) and harness the cellular machinery 
to grow up sufficient cDNA for sequencing by the techniques 

10 and analysis described herein (cf. U.S. Patent Nos. 
5,021,335 and 5,168,038, which are incorporated by 
reference) . The examples given herein utilized the 
following immortalized cell lines: monocyte-like U-937 
cells, activated macrophage-like THP-1 cells, induced 

15 vascular endothelial cells (HUVEC cells) and mast cell-like 
HMC-1 cells. 

The U-937 cell line is a human histiocytic lymphoma 
cell line with monocyte characteristics, established from 
malignant cells obtained from the pleural effusion of a 

20 patient with diffuse histiocytic lymphoma (Sundstrom, C. 
and Nilsson, K. (1976) Int. J. Cancer 17:565). U-937 is 
one of only a few human cell lines with the morphology, 
cytochemistry, surface receptors and monocyte-like 
characteristics of histiocytic cells. These cells can be 

25 induced to terminal monocytic differentiation and will 
express new cell surface molecules when activated with 
supernatants from human mixed lymphocyte cultures. Upon 
this type of in vitro activation, the cells undergo 
morphological and functional changes, including 

30 augmentation of antibody-dependent cellular cytotoxicity 

(ADCC) against erythroid and tumor target cells (one of the 
principal functions of macrophages) . Activation of U-937 
cells with phorbol 12-myristate 13-acetate (PMA) in vitro 
stimulates the production of several compounds, including 

35 prostaglandins, leukotrienes and platelet-activating factor 
(PAF) , which are potent inflammatory mediators. Thus, U- 
937 is a cell line that is well suited for the 
identification and isolation of gene transcripts associated 
with normal monocytes. 
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The HUVEC cell line is a normal, homogeneous, well 
characterized, early passage endothelial cell culture from 
human umbilical vein (Cell Systems Corp., 12815 NE 124th 
Street, Kirkland, WA 98034) . Only gene transcripts from 
5 induced, or treated, HUVEC cells were sequenced. One batch 
of 1 X 10 8 cells was treated for 5 hours with 1 U/ml rIL-lb 
and 100 ng/ml E.coli lipqpolysaccharide (LPS) endotoxin 
prior to harvesting. A separate batch of 2 X 10 8 cells was 
treated at confluence with 4 U/ml TNF and 2 U/ml 

10 inter feron-gamma (IFN-gamma) prior to harvesting. 

THP-1 is a human leukemic cell line with distinct 
monocytic characteristics. This cell line was derived from 
the blood of a 1-year-old boy with acute monocytic leukemia 
(Tsuchiya, S. et al. (1980) Int. J. Cancer: 171-76). The 

15 following cytological and cytochemical criteria were used 
to determine the monocytic nature of the cell line: 1) the 
presence of alpha-naphthyl butyrate esterase activity which 
could be inhibited by sodium fluoride; 2) the production of 
lysozyme; 3) the phagocytosis of latex particles and 

20 sensitized SRBC (sheep red blood cells); and 4) the ability 
of mitomycin C-treated THP-1 cells to activate T- 
lymphocytes following ConA (concanavalin A) treatment. 
Morphologically, the cytoplasm contained small azurophilic 
granules and the nucleus was indented and irregularly 

25 shaped with deep folds. The cell line had Fc and C3b 
receptors, probably functioning in phagocytosis. THP-1 
cells treated with the tumor promoter 12-o-tetradecanoyl- 
phorbol-13 acetate (TPA) stop proliferating and 
differentiate into macrophage-like cells which mimic native 

30 monocyte-derived macrophages in several respects. 

Morphologically, as the cells change shape, the nucleus 
becomes more irregular and additional phagocytic vacuoles 
appear in the cytoplasm. The differentiated THP-1 cells 
also exhibit an increased adherence to tissue culture 

35 plastic. 

HMC-1 cells (a human mast cell line) were established 
from the peripheral blood of a Mayo Clinic patient with 
mast cell leukemia (Leukemia Res. (1988) 12:345-55). The 
cultured cells looked similar to immature cloned murine 

17 



WO 95/20681 



PCT/US95/01160 



mast cells , contained histamine, and stained positively for 
chloroacetate esterase, amino caproate esterase, eosinophil 
major basic protein (MBP) and tryptase. The HMC-1 cells 
have, however, lost the ability to synthesize normal IgE 
5 receptors. HMC-1 cells also possess a 10; 16 translocation; 
present in cells initially collected by leukophoresis from 
the patient and not an artifact of culturing. Thus, HMC-1 
cells are a good model for mast ceils. 

6,2. CONSTRUCTION OF CDNA LIBRARIES 

id " For inter-library ''cbmparisbnsV' >; tlie ; libraries must be 
prepared in similar manners. Certain parameters appear to 
be particularly important to control. One such parameter 
is the method of isolating mRNA. It is important to use 
the same conditions to remove DNA and heterogeneous nuclear 

15 RNA from comparison libraries. Size fractionation of cDNA 
must be carefully controlled. The same vector preferably 
should be used for preparing libraries to be compared. At 
the very least, the same type of vector (e.g. , 
unidirectional vector) should be used to assure a valid 

20 comparison. A unidirectional vector may be preferred in 
order to more easily analyze the output. 

It is preferred to prime only with oligo dT 
unidirectional primer in order to obtain one only clone per 
mRNA transcript when obtaining cDNAs . However, it is 

25 recognized that employing a mixture of oligo dT and random 
primers can also be advantageous because such a mixture 
results in more sequence diversity when gene discovery also 
is a goal. Similar effects can be obtained with DR2 
(Clontech) and HXLOX (US Biochemical) and also vectors from 

30 Invitrogen and Novagen. These vectors have two 

requirements. First, there must be primer sites for 
commercially available primers such as T3 or M13 reverse 
primers. Second, the vector must accept inserts up to 10 
kB. 

35 It also is important that the clones be randomly 

sampled, and that a significant population of clones is 
used. Data have been generated with 5,000 clones; however, 
if very rare genes are to be obtained and/or their relative 
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abundance determined, as many as 100,000 clones from a 
single library may need to be sampled. Size fractionation 
of cDNA also must be carefully controlled. Alternately, 
plaques can be selected, rather than clones. 
5 Besides the Uni-ZAP™ vector system by Stratagene 

disclosed below, it is now believed that other similarly 
unidirectional vectors also can be used. For example, it 
is believed that such vectors include but are; not limited 
to DR2 (Clontech) , and HXLOX (U.S. Biochemical). 

10 Preferably, the details of library construction (as 

shown in Figure 1) are collected and stored in a database 
for later retrieval relative to the sequences being 
compared. Fig. l shows important information regarding the 
library collaborator or cell or cDNA supplier, 

15 pretreatment, biological source, culture, mRNA preparation 
■ and cDNA construction. Similarly detailed information 
about the other steps is beneficial in analyzing sequences 
and libraries in depth. 

RNA must be harvested from cells and tissue samples 

20 and cDNA libraries are subsequently constructed. cDNA 

libraries can be constructed according to techniques known 
in the art. (See, for example, Maniatis, T. et al. (1982) 
Molecular Cloning, Cold Spring Harbor Laboratory, New 
York) . cDNA libraries may also be purchased. The U-937 

25 cDNA library (catalog No. 937207) was obtained from 

Stratagene, Inc., 11099 M. Torrey Pines Rd. ; La Jolla, CA 
92037. 

The THP-1 cDNA library was custom constructed by 
Stratagene from THP-1 cells cultured 4 8 hours with 100 nm 

30 TPA and 4 hours with 1 /xg /ml LPS. The human mast cell HMC- 
1 cDNA library was also custom constructed by Stratagene 
from cultured HMC-l cells. The HUVEC cDNA library was 
custom constructed by Stratagene from two batches of 
induced HUVEC cells which were separately processed. 

35 Essentially, all the libraries were prepared in the 

same manner. First, poly (A+) RNA (mRNA) was purified. For 
the U-937 and HMC-l RNA, cDNA synthesis was only primed 
with oligo dT. For the THP-1 and HUVEC RNA, cDNA synthesis 
was primed separately with both oligo dT and random 
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hexamers, and the two cDNA libraries were treated 
separately. Synthetic adaptor oligonucleotides were 
ligated onto cDNA ends enabling its insertion into the Uni- 
Zap™ vector system (Stratagene) , allowing high efficiency 
5 unidirectional (sense orientation) lambda library 

construction and the convenience of a plasmid system with 
blue-white color selection to detect clones with cDNA 
insertions. Finally, the two libraries were combined into 
a single library by mixing equal numbers of bacteriophage. 

10 The libraries can be screened with either DNA probes 

or antibody probes and the pBluescript® phagemid 
(Stratagene) can be rapidly excised in vivo . The phagemid 
allows the use of a plasmid system for easy insert 
characterization, sequencing, site-directed mutagenesis, 

15 the creation of unidirectional deletions and expression of 
fusion proteins. The custom-constructed library phage 
particles were infected into E. coli host strain XLl-Blue® 
(Stratagene), which has a high transformation efficiency, 
increasing the probability of obtaining rare, under- 

2 0 represented clones in the cDNA library. 

6.3. ISOLATION OF cDNA CLONES 

The phagemid forms of individual cDNA clones were 
obtained by the in vivo excision process, in which the host 
bacterial strain was coinfected with both the lambda 
25 library phage and an fl helper phage. Proteins derived 

from both the library-containing phage and the helper phage 
nicked the lambda DNA, initiated new DNA synthesis from 
defined sequences on the lambda target DNA and created a 
smaller, single stranded circular phagemid DNA molecule 

3 0 that included all DNA sequences of the pBluescript® plasmid 

and the cDNA insert. The phagemid DNA was secreted from 
the cells and purified, then used to re-infect fresh host 
cells, where the double stranded phagemid DNA was produced. 
Because the phagemid carries the gene for beta-lactamase, 
35 the newly-transformed bacteria are selected on medium 
containing ampicillin. 

Phagemid DNA was purified using the Magic Minipreps™ 
DNA Purification System (Promega catalogue #A7100. Promega 
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Corp., 2800 Woods Hollow Rd., Madison, WI 53711). This 
small-scale process provides a simple and reliable method 
for lysing the bacterial cells and rapidly isolating 
purified phagemid DNA using a proprietary DNA-binding 
5 resin. The DNA was eluted from the purification resin 
already prepared for DNA sequencing and other analytical 
manipulations. 

Phagemid DNA was also purified using the QIAwell-8 
Plasmid Purification System from QIAGEN® DNA Purification 

10 System (QIAGEN Inc., 9259 Eton Ave., Chattsworth, CA 

91311) . This product line provides a convenient, rapid and 
reliable high-throughput method for lysing the bacterial 
cells and isolating highly purified phagemid DNA using 
QIAGEN anion-exchange resin particles with EMPOREf™ membrane 

15 technology from 3M in a multiwell format. The DNA was 

eluted from the purification resin already prepared for DNA 
sequencing and other analytical manipulations. 

An alternate method of purifying phagemid has recently 
become available. It utilizes the Min'iprep Kit (Catalog 

20 No. 77468, available from Advanced Genetic Technologies 
Corp., 19212 Orbit Drive, Gaithersburg, Maryland). This 
kit is in the 96-well format and provides enough reagents 
for 960 purifications. Each kit is provided with a 
recommended protocol, which has been employed except for 

25 the following changes. First, the 96 wells are each filled 
with only 1 ml of sterile terrific broth with carbenicillin 
at 25 mg/L and glycerol at 0.4%. After the wells are 
inoculated, the bacteria are cultured for 24 hours and 
lysed with 60 /il of lysis buffer. A centrif ugation step 

30 (2900 rpm for 5 minutes) is performed before the contents 
of the block are added to the primary filter plate. The 
optional step of adding isopropanol to TRIS buffer is not 
routinely performed. After the last step in the protocol, 
samples are transferred to a Beckman 96-well block for 

35 storage. 

Another new DNA purification system is the WIZARD™ 
product line which is available from Promega (catalog No. 
A7071) and may be adaptable to the 96-well format. 
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6.4. SEQUENCING OF cDNA CLONES 

The cDNA inserts from random isolates of the U-937 and 

* 

THP-1 libraries were sequenced in part. Methods for DNA 
sequencing are well known in the art. Conventional 
5 enzymatic methods employ DNA polymerase Klenow fragment, 
Sequenase™ or Taq polymerase to extend DNA chains from an 
oligonucleotide primer annealed to the DNA template of 
interest. Methods have been developed for the use of both 
single- arid double-stranded templates. * The chain 

10 termination reaction products are usually electrophoresed 
oh urea-acfylamide 'gels and ""are detected' either by 
autoradiography (for radionuclide-labeled precursors) or by 
fluorescence (for fluorescent-labeled precursors) . Recent 
improvements in mechanized reaction preparation, sequencing 

15 and analysis using the fluorescent detection method have 
permitted expansion in the number of sequences that can be 
determined per day (such as the Applied Biosystems 373 and 
377 DNA sequencer, Catalyst 800) . Currently with the 
system as described, read lengths range from 250 to 400 

20 bases and are clone dependent- Read length also varies 
with the length of time the gel is run. In general, the 
shorter runs tend to truncate the sequence. A minimum of 
only about 25 to 50 bases is necessary to establish the 
identification and degree of homology of the sequence. 

25 Gene transcript imaging can be used with any sequence- 
specific method, including, but not limited to 
hybridization, mass spectroscopy, capillary electrophoresis 
and 505 gel electrophoresis. 

6.5. HOMOLOGY SEARCHING OF cDNA CLONE AND 
30 DEDUCED PROTEIN (and Subsequent Steps) 

Using the nucleotide sequences derived from the cDNA 

clones as query sequences (sequences of a Sequence 

Listing) , databases containing previously identified 

sequences are searched for areas of homology (similarity) . 

35 Examples of such databases include Genbank and EMBL. We 

next describe examples of two homology search algorithms 

that can be used, and then describe the subsequent 

computer-implemented steps to be performed in accordance 

with preferred embodiments of the invention. 
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In the following description of the computer- 
implemented steps of the invention, the word "library" 
denotes a set (or population) of biological specimen 
nucleic acid sequences. A "library" can consist of cDNA 
5 sequences, RNA sequences, or the like, which characterize a 
biological specimen. The biological specimen can consist 
of cells of a single human cell type (or can be any of the 
other 'aBove-mentioned types of specimens) . We contemplate 
that the sequences in a library have been determined so as 
10 to accurately represent or characterize a biological 

specimen (for example," they can consist of representative 
cDNA sequences from clones of RNA taken from a single human 
cell) . 

In the following description of the computer- 
15 implemented steps of the invention, the expression 

"database" denotes a set of stored data which represent a 
collection of sequences, which in turn represent a 
collection of biological reference materials. For example, 
a database can consist of data representing many stored 
20 cDNA sequences which are in turn representative of human 
cells infected with various viruses, cells of humans of 
various ages, cells from different mammalian species, and 
so on. 

In preferred embodiments, the invention employs a 
25 computer programmed with software (to be described) for 
performing the following steps: 

(a) processing data indicative of a library of cDNA 
sequences (generated as a result of high-throughput cDNA 
sequencing or other method) to determine whether each 

30 sequence in the library matches a DNA sequence of a 

reference database of DNA sequences (and if so, identifying 
the reference database entry which matches the sequence and 
indicating the degree of match between the reference 
sequence and the library sequence) and assigning an 

35 identified sequence value based on the sequence annotation 
and degree of match to each of the sequences in the 
library; 

(b) for some or all entries of the database, 
tabulating the number of matching identified sequence 
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values in the library (Although this can be done by human 
hand from a printout of all entries, we prefer to perform 
this step using computer software to be described below.)/ 
thereby generating a set of final data values or "abundance 
5 numbers"; and 

(c) if the libraries are different sizes, dividing 
each abundance number by the total number of sequences in 
the library, to obtain a relative abundance number for each 
. ^ i d ? nti fied seguence^ value (i.e., a relative abundance of 
10 each gene transcript) . 

The list of identified sequence values (or genes 
corresponding thereto) can then be sorted by abundance in 
the cDNA population. A multitude of additional types of 
comparisons or dimensions are possible. 
15 For example (to be described below in greater detail), 

steps (a) and (b) can be repeated for two different 
libraries (sometimes referred to as a "target" library and 
a "subtractant" library). Then, for each identified 
sequence value (or gene transcript) , a "ratio" value is 
20 obtained by dividing the abundance number (for that 

identified sequence value) for the target library, by the 
abundance number (for that identified sequence value) for 
the subtractant library. 

In fact, subtraction may be carried out on multiple 
25 libraries. It is possible to add the transcripts from 

several libraries (for example, three) and then to divide 
them by another set of transcripts from multiple libraries 
(again, for example, three) . Notation for this operation 
may be abbreviated as (A+B+C) / (D+E+F) , where the capital 
letters each indicate an entire library. Optionally the 
abundance numbers of transcripts in the summed libraries 
may be divided by the total sample size before subtraction. 

Unlike standard hybridization technology which permits 
a single subtraction of two libraries, once one has 
35 processed a set or library transcript sequences and stored 
them in the computer, any number of subtractions can be 
performed on the library. For example, by this method, 
ratio values can be obtained by dividing relative abundance 



30 



24 



WO 95/20681 



PCT/US95/01160 



values in a first library by corresponding values in a 
second library and vice versa. 

In variations on step (a) , the library consists of 
nucleotide sequences derived from cDNA clones. Examples of 
5 databases which can be searched for areas of homology 

(similarity) in step (aj include the commercially available 
databases known as Genbank (NIH) EMBL (European Molecular 
Biology Labs, Germany), and GENESEQ (Intelligenetics, 
Mountain View. California) . 

10 One homology search algorithm which can be used to 

implement step (a) is the algorithm described in the paper 
by D.J. Lipman and W.R. Pearson, entitled "Rapid and 
Sensitive Protein Similarity Searches," Science . 227:1435 
(1985). In this algorithm, the homologous regions are 

15 searched in a two-step manner. In the first step, the 

highest homologous regions are determined by calculating a 
matching score using a homology score table. The parameter 
"Ktup" is used in this step to establish the minimum window 
size to be shifted for comparing two sequences. Ktup also 

20 sets the number of bases that must match to extract the 
highest homologous region among the sequences. In this 
step, no insertions or deletions are applied and the 
homology is displayed as an initial (INIT) value. 

In the second step, the homologous regions are aligned 

25 to obtain the highest matching score by inserting a gap in 
order to add a probable deleted portion. The matching 
score obtained in the first step is recalculated using the 
homology score Table and the insertion score Table to an 
optimized (OPT) value in the final output. 

30 DNA homologies between two sequences can be examined 

graphically using the Harr method of constructing dot 
matrix homology plots (Needleman, S.B. and Wunsch, CO., J. 
Mom. Biol 48:443 (1970)). This method produces a 
two-dimensional plot which can be useful in determining 

35 regions of homology versus regions of repetition. 

However, in a class of preferred embodiments, step (a) 
is implemented by processing the library data in the 
commercially available computer program known as the 
INHERIT 670 Sequence Analysis System, available from 

25 
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Applied Biosystems Inc. (Foster City, California) , 
including the software known as the Factura software (also 
available from Applied Biosystems Inc.)* The Factura 
program preprocesses each library sequence to "edit out" 
5 portions thereof which are not likely to be of interest, 
such as the vector used to prepare the library. Additional 
sequences which can be edited out or masked (ignored by the 
search tools) include but are not limited to the polyA tail 
and repetitive GAG and CCC sequences. A low-end search* 

10 program can be written to mask out such "low-information" 
sequences, or programs such as BLAST can ignore the low- 
information sequences. 

In the algorithm implemented by the INHERIT 670 
Sequence Analysis System, the Pattern Specification 

15 Language (developed by TRW Inc.) is used to determine 
regions of homology. "There are three parameters that 
determine how INHERIT analysis runs sequence comparisons: 
window size, window offset and error tolerance. Window 
size specifies the length of the segments into which the 

20 query sequence is subdivided. Window offset specifies 

where to start the next segment [to be compared] , counting 
from the beginning of the previous segment. Error 
tolerance specifies the total number of insertions, 
deletions and/or substitutions that are tolerated over the 

25 specified word length. Error tolerance may be set to any 
integer between 0 and 6. The default settings are window 
tolerance=20, window offset=10 and error tolerance=3 . " 
INHERIT Analysis Users Manual , pp. 2-15. Version 1.0, 
Applied Biosystems, Inc., October 1991. 

30 Using a combination of these three parameters, a 

database (such as a DNA database) can be searched for 
sequences containing regions of homology and the 
appropriate sequences are scored with an initial value. 
Subsequently, these homologous regions are examined using 

35 dot matrix homology plots to determine regions of homology 
versus regions of repetition. Smith-Waterman alignments 
can be used to display the results of the homology search. 
The INHERIT software can be executed by a Sun computer 
system programmed with the UNIX operating system. 
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Search alternatives to INHERIT include the BLAST 
program, GCG (available from the Genetics Computer Group, 
WI) and the Dasher program (Temple Smith, Boston 
University, Boston, MA) . Nucleotide sequences can be 
5 searched against Genbank, EMBL or custom databases such as 
GENESEQ (available from Intelligenetics, Mountain View, CA) 
or other databases for genes. In addition, we have 
searched some sequences against our own in-house database. 
In preferred embodiments, the transcript sequences are 

10 analyzed by the INHERIT software for best conformance with 
a reference gene transcript to assign a sequence identifier 
and assigned the degree of homology, which together are the 
identified sequence value and are input into, and further 
processed by, a Macintosh personal computer (available from 

15 Apple) programmed with an "abundance sort and subtraction 
analysis" computer program (to be described below) . 

Prior to the abundance sort and subtraction analysis 
program (also denoted as the "abundance sort" program) , 
identified sequences from the cDNA clones are assigned 

20 value (according to the parameters given above) by degree 
of match according to the following categories: "exact" 
matches (regions with a high degree of identity) , 
homologous human matches (regions of high similarity, but 
hot "exact" matches) , homologous non-human matches (regions 

25 of high similarity present in species other than human) , or 
non matches (no significant regions of homology to 
previously identified nucleotide sequences stored in the 
form of the database) . Alternately, the degree of match 
can be a numeric value as described below. 

30 With reference again to the step of identifying 

matches between reference sequences and database entries, 
protein and peptide sequences can be deduced from the 
nucleic acid sequences. Using the deduced polypeptide 
sequence, the match identification can be performed in a 

35 manner analogous to that done with cDNA sequences. A 

protein sequence is used as a query sequence and compared 
to the previously identified sequences contained in a 
database such as the Swiss/Prot, PIR and the NBRF Protein 
database to find homologous proteins. These proteins are 
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initially scored for homology using a homology score Table 
(Orcutt, B.C. and Dayoff, M.O. Scoring Matrices, PIR 
Report MAT - 0285 (February 1985)) resulting in an INIT 
score. The homologous regions are aligned to obtain the 
5 highest matching scores by inserting a gap which adds a 
probable deleted portion. The matching score is 
recalculated using the homology score Table and the 
insertion score Table resulting in an optimized (OPT) 
score. Even in the absence, of knowledge of the proper 

10 reading frame of an isolated sequence, the above-described 
protein homology search may be performed by searching all 3 
reading frames. 

Peptide and protein sequence homologies can also be 
ascertained using the INHERIT 670 Sequence Analysis System 

15 in an analogous way to that used in DNA sequence 

homologies. Pattern Specification Language and parameter 
windows are used to search protein databases for sequences 
containing regions of homology which are scored with an 
initial value. Subsequent display in a dot-matrix homology 

20 plot shows regions of homology versus regions of 

repetition. Additional search tools that are available to 
use on pattern search databases include PLsearch Blocks 
(available from Henikoff & Henikoff , University of 
Washington, Seattle), Dasher and GCG. Pattern search 

25 databases include, but are not limited to, Protein Blocks 
(available from Henikoff & Henikoff, University of 
Washington, Seattle), Brookhaven Protein (available from 
the Brookhaven National Laboratory, Brookhaven, MA), 
PROSITE (available from Amos Bairoch, University of Geneva, 

30 Switzerland) , ProDom (available from Temple Smith, Boston 
University) , and PROTEIN MOTIF FINGERPRINT (available from 
University of Leeds, United Kingdom). 

The ABI Assembler application software, part of the 
INHERIT DNA analysis system (available from Applied 

35 Biosystems, Inc., Foster City, CA) , can be employed to 

create and manage sequence assembly projects by assembling 
data from selected sequence fragments into a larger 
sequence. The Assembler software combines two advanced 
computer technologies which maximize the ability to 
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assemble sequenced DNA fragments into Assemblages , a 
special grouping of data where the relationships between 
sequences are shown by graphic overlap, alignment and 
statistical views. The process is based on the 
5 Meyers-Kececioglu model of fragment assembly (INHERIT™ 
Assembler User's Manual, Applied Biosystems, Inc., Foster 
City, CA) , and uses graph theory as the foundation of a 
very rigorous multiple sequence alignment engine for 
assembling DNA sequence fragments. Other assembly programs 

10 that can be used include MEG ALIGN (available from DNASTAR 
Inc., Madison, WI ) , basher and STADEN (available from Roger 
Staden, Cambridge, England) . 

Next, with reference to Fig. 2, we describe in more 
detail the "abundance sort" program which implements above- 

15 mentioned "step (b) " to tabulate the number of sequences of 
* the library which match each database entry (the "abundance 
number" for each database entry). 

Fig. 2 is a flow chart of a preferred embodiment of 
the abundance sort program. A source code listing of this 

20 embodiment of the abundance sort program is set forth in 

Table 5. In the Table 5 implementation, the abundance sort 
program is written using the FoxBASE programming language 
commercially available from Microsoft Corporation, 
Although FoxBASE was the program chosen for the first 

25 iteration of this technology, it should not be considered 
limiting. Many other programming languages, Sybase being a 
particularly desirable alternative, can also be used, as 
will be obvious to one with ordinary skill in the art. The 
subroutine names specified in Fig. 2 correspond to 

30 subroutines listed in Table 5. 

With reference again to Fig. 2, the "Identified 
Sequences" are transcript sequences representing each 
sequence of the library and a corresponding identification 
of the database entry (if any) which it matches. In other 

35 words, the "Identified Sequences" are transcript sequences 
representing the output of above-discussed "step (a)." 

Fig. 3 is a block diagram of a system for implementing 
the invention. The Fig. 3 system includes library 
generation unit 2 which generates a library and asserts an 
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output stream of transcript sequences indicative of the 
biological sequences comprising the library. Programmed 
processor 4 receives the data stream output from unit 2 and 
processes this data in accordance with above-discussed 
5 "step (a)" to generate the Identified Sequences. Processor 
4 can be a processor programmed with the commercially 
available computer program known as the INHERIT 670 
Sequence Analysis System and the commercially available 
computer program known as the Factura program (both 

10 available from Applied Biosystems Inc.) and with the UNIX 
operating system. 

Still with reference to Fig. 3, the Identified 
Sequences are loaded into processor 6 which is programmed 
with the abundance sort program. Processor 6 generates the 

15 Final Transcript sequences indicated in both Figs. 2 and 3. 
Fig. 4 shows a more detailed block diagram of a planned 
relational computer system, including various searching 
techniques which can be implemented, along with an 
assortment of databases to query against. 

20 With reference to Fig. 2, the abundance sort program 

first performs an operation known as "Tempnum" on the 
Identified Sequences, to discard all of the Identified 
Sequences except those which match database entries of 
selected types. For example, the Tempnum process can 

25 select Identified Sequences which represent matches of the 
following types with database entries (see above for 
definition) : "exact" matches, human "homologous" matches, 
"other species" matches representing genes present in 
species other than human) , "no" matches (no significant 

30 regions of homology with database entries representing 
previously identified nucleotide sequences) , "I" matches 
(Incyte for not previously known DNA sequences) , or "X" 
matches (matches ESTs in reference database) . This 
eliminates the U, S, M, V, A, R and D sequence (see Table 1 

35 for definitions). 

The identified sequence values selected during the 
"Tempnum" process then undergo a further selection (weeding 
out) operation known as "Tempred." This operation can, for 
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example, discard all identified sequence values 
representing matches with selected database entries. 

The identified sequence values selected during the 
"Tempred" process are then classified according to library, 
5 during the "Tempdesig" operation. It is contemplated that 
the "Identified Sequences" can represent sequences from a 
single library, or from two or more libraries. 

> ■ ♦ - ^ ■ . - . 

Consider first the case that the identified sequence 
values represent sequences from a single library. In this 

10 case, all the identified sequence values determined during 
"Tempred" undergo sorting in the "Templib" operation, 
further sorting in the "Libsort" operation, and finally 
additional sorting in the "Temptarsort" operation. For 
example, these three sorting operations can sort the 

15 identified sequences in order of decreasing "abundance 
number" (to generate a list of decreasing abundance 
numbers, each abundance number corresponding to a unique 
identified sequence entry, or several lists of decreasing 
abundance numbers, with the abundance numbers in each list 

20 corresponding to database entries of a selected type) with 
redundancies eliminated from each sorted list. In this 
case, the operation identified as "Cruncher" can be 
bypassed, so that the "Final Data" values are the organized 
transcript sequences produced during the "Temptarsort" 

25 operation. 

We next consider the case that the transcript 
sequences produced during the "Tempred" operation represent 
sequences from two libraries (which we will denote the 
"target" library and the "subtractant" library) . For 

30 example, the target library may consist of cDNA sequences 
from clones of a diseased cell, while the subtractant 
library may consist of cDNA sequences from clones of the 
diseased cell after treatment by exposure to a drug. For 
another example, the target library may consist of cDNA 

35 sequences from clones of a cell type from a young human, 

while the subtractant library may consist of cDNA sequences 
from clones of the same cell type from the same human at 
different ages. 
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In this case, the "Tempdesig" operation routes all 
transcript sequences representing the target library for 
processing in accordance with "Tempi ib" (and then "Libsort" . 
and "Temptarsort") , and routes all transcript sequences 
5 representing the subtractant library for processing in 
accordance with "Tempsub" (and then "Subsort" and 
"Tempsubsort"). For example, the consecutive "Templib," 
"Libsort," and "Temptarsort" sorting operations sort 
identified sequences from the target library in order of 

10 decreasing abundance number (to generate a list of 
decreasing abundance numbers, each abundance number 
corresponding to a database entry, or several lists of 
decreasing abundance numbers, with the abundance numbers in 
each list corresponding to database entries of a selected 

15 type) with redundancies eliminated from each sorted list. 
'The consecutive "Tempsub," "Subsort," and "Tempsubsort" 
sorting operations sort identified sequences from the 
subtractant library in order of decreasing abundance number 
(to generate a list of decreasing abundance numbers, each 

20 abundance number corresponding to a database entry, or 
several lists of decreasing abundance numbers, with the 
abundance numbers in each list corresponding to database 
entries of a selected type) with redundancies eliminated 
from each sorted list. 

25 The transcript sequences output from the "Temptarsort" 

operation typically represent sorted lists from which a 
histogram could be generated in which position along one 
(e.g., horizontal) axis indicates abundance number (of 
target library sequences) , and position along another 

30 (e.g., vertical) axis indicates identified sequence value 
(e.g., human or non-human gene type). Similarly, the 
transcript sequences output from the "Tempsubsort" 
operation typically represent sorted lists from which a 
histogram could be generated in which position along one 

35 (e.g., horizontal) axis indicates abundance number (of 

subtractant library sequences) , and position along another 
(e.g., vertical) axis indicates identified sequence value 
(e.g., human or non-human gene type). 
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The transcript sequences (sorted lists) output from 
the Tempsubsort and Temptarsort sorting operations are 
combined during the operation identified as "Cruncher." 
The "Cruncher" process identifies pairs of corresponding 
5 target and subtractant abundance numbers (both representing 
the same identified sequence value) , and divides one by the 
other to generate a "ratio" value for each pair of 
corresponding abundance numbers, and then sorts the ratio 
values in order of decreasing ratio value. The data output 

10" from the "Cruncher" operation (the" Final Transcript 

sequence in Fig. 2) is typically a sorted list from which a 
histogram could be generated in which position along one 
axis indicates the size of a ratio of abundance numbers 
(for corresponding identified sequence values from target 

15 and subtractant libraries) and position along another axis 
indicates identified sequence value (e.g., gene type). 

Preferably, prior to obtaining a ratio between the two 
library abundance values, the Cruncher operation also 
divides each ratio value by the total number of sequences 

20 in one or both of the target and subtractant libraries. 

The resulting lists of "relative" ratio values generated by 
the Cruncher operation are useful for many medical, 
scientific, and industrial applications. Also preferably, 
the output of the Cruncher operation is a set of lists, 

25 each list representing a sequence of decreasing ratio 
values for a different selected subset (e.g. protein 
family) of database entries. 

In one example, the abundance sort program of the 
invention tabulates for a library the numbers of mRNA 

30 transcripts corresponding to each gene identified in a 

database. These numbers are divided by the total number of 
clones sampled. The results of the division reflect the 
relative abundance of the mRNA transcripts in the cell type 
or tissue from which they were obtained. Obtaining this 

35 final data set is referred to herein as "gene transcript 
image analysis." The resulting subtracted data show 
exactly what proteins and genes are upregulated and 
downregulated in highly detailed complexity. 
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6.6. HUVEC cDNA LIBRARY 

Table 2 is an abundance table listing the various gene 
transcripts in an induced HUVEC library. The transcripts 
are listed in order of decreasing abundance. This 
5 computerized sorting simplifies analysis of the tissue and 
speeds identification of significant new proteins which are 
specific to this cell type. This type of endothelial cell 
lines tissues of the cardiovascular system, and the more 
that is known about its composition, particularly in 
id response to activatibn, ^ the more choices of protein targets 
become available to affect in treating disorders of this 
tissue, such as the highly prevalent atherosclerosis. 

6.7. MONOCYTE -CELL AND MAST-CELL cDNA LIBRARIES 

Tables 3 and 4 show truncated comparisons of two 

15 libraries. In Tables 3 and 4 the "normal monocytes" are 
the HMC-1 cells, and the "activated macrophages" are the 
THP-1 cells pretreated with PMA and activated with LPS. 
Table 3 lists in descending order of abundance the most 
abundant gene transcripts for both cell types. With only 

20 15 gene transcripts from each cell type, this table permits 
quick, qualitative comparison of the most common 
transcripts. This abundance sort, with its convenient 
side-by-side display, provides an immediately useful 
research tool. In this example, this research tool 

25 discloses that 1) only one of the top 15 activated 
macrophage transcripts is found in the top 15 normal 
monocyte gene transcripts (poly A binding protein); and 2) 
a new gene transcript (previously unreported in other 
databases) is relatively highly represented in activated 

30 macrophages but is not similarly prominent in normal 

macrophages. Such a research tool provides researchers 
with a short-cut to new proteins, such as receptors, cell- 
surface and intracellular signalling molecules, which can 
serve as drug targets in commercial drug . screening 

35 programs. Such a tool could save considerable time over 
that consumed by a hit and miss discovery program aimed at 
identifying important proteins in and around cells, because 
those proteins carrying out everyday cellular functions and 
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represented as steady state mRNA are quickly eliminated 
from further characterization. 

This illustrates how the gene transcript profiles 
change with altered cellular function. Those skilled in 
5 the art know that the biochemical composition of cells also 
changes with other functional changes such as cancer, 
including cancer's various stages, and exposure to 
toxicity. A gene transcript subtraction profile such as in 
Table 3 is useful as a first screening tool for such gene 
10 expression and protein studies. 

6.8. SUBTRACTION ANALYSIS OF NORMAL MONOCYTE-CELL AND 
ACTIVATED MONOCYTE CELL CDNA LIBRARIES 

Once the cDNA data are in the computer, the computer 
program as disclosed in Table 5 was used to obtain ratios 

15 of all the gene transcripts in the two libraries discussed 
in Example 6.7, and the gene transcripts were sorted by the 
descending values of their ratios. If a gene transcript is 
not represented in one library, that gene transcript's 
abundance is unknown but appears to be less than 1. As an 

20 approximation — and to obtain a ratio, which would not be 
possible if the unrepresented gene were given an abundance 
of zero — genes which are represented in only one of the 
two libraries are assigned an abundance of 1/2. Using 1/2 
for unrepresented clones increases the relative importance 

25 of "turned-on" and "turned-of f " genes, whose products would 
be drug candidates. The resulting print-out is called a 
subtraction table and is an extremely valuable screening 
method, as is shown by the following data. 

Table 4 is a subtraction table, in which the normal 

30 monocyte library was electronically "subtracted" from the 
activated macrophage library. This table highlights most 
effectively the changes in abundance of the gene 
transcripts by activation of macrophages. Even among the 
first 20 gene transcripts listed, there are several unknown 

35 gene transcripts. Thus, electronic subtraction is a useful 
tool with which to assist researchers in identifying much 
more quickly the basic biochemical changes between two cell 
types. Such a tool can save universities and 
pharmaceutical companies which spend billions of dollars on 
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research valuable time and laboratory resources at the 
early discovery stage and can speed up the drug development 
cycle, which in turn permits researchers to set up drug 
screening programs much earlier. Thus, this research tool 
5 provides a way to get new drugs to the public faster and 
more economically. 

Also, such a subtraction table can be obtained for 
patient diagnosis. An individual patient sample (such as 
monocytes obtained from a biopsy or blood sample) can be 

10 compared with data provided herein to diagnose conditions 
associated with macrophage activation. 

Table 4 uncovered many new gene transcripts (labeled 
Incyte clones) ♦ Note that many genes are turned on in the 
activated macrophage (i.e., the monocyte had a 0 in the 

15 bgfreq column). This screening method is superior to other 
screening techniques, such as the western blot, which are 
incapable of uncovering such a multitude of discrete new 
gene transcripts. 

The subtraction-screening technique has also uncovered 

20 a high number of cancer gene transcripts (oncogenes rho, 
ETS2, rab-2 ras, YPTl-related, and acute myeloid leukemia 
mRNA) in the activated macrophage. These transcripts may 
be attributed to the use of immortalized cell lines and are 
inherently interesting for that reason. This screening 

25 technique offers a detailed picture of upregulated 

transcripts including oncogenes, which helps explain why 
anti-cancer drugs interfere with the patient's immunity 
mediated by activated macrophages. Armed with knowledge 
gained from this screening method, those skilled in the art 

30 can set up more targeted, more effective drug screening 
programs to identify drugs which are differentially 
effective against 1) both relevant cancers and activated 
macrophage conditions with the same gene transcript 
profile; 2) cancer alone; and 3) activated macrophage 

35 conditions. 

Smooth muscle senescent protein (22 kd) was 
upregulated in the activated macrophage, which indicates 
that it is a candidate to block in controlling 
inflammation. 
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6.9. SUBTRACTION ANALYSIS OP NORMAL LIVER CELLS AND 
HEPATITIS INFECTED LIVER CELL CDNA LIBRARIES 

In this example, rats are exposed to hepatitis virus 
and maintained in the colony until they show definite signs 
5 of hepatitis. Of the rats diagnosed with hepatitis, one 
half of the rats are treated with a new anti-hepatitis 
agent (AHA) . Liver samples are obtained from all rats 
before exposure to the hepatitis virus and at the end of 
AHA treatment or no treatment. In addition, liver samples 

10 can be obtained from rats with hepatitis just prior to AHA 

■ treatment'. ■ ^ ...o. ■ ...«.......■-. 

The liver tissue is treated as described in Examples 
6.2 and 6.3 to obtain mRNA and subsequently to sequence 
cDNA. The cDNA from each sample are processed and analyzed 

15 for abundance according to the computer program in Table 5. 
The resulting gene transcript images of the cDNA provide 
detailed pictures of the baseline (control) for each animal 
and of the infected and/or treated state of the animals. 
cDNA data for a group of samples can be combined into a 

20 group summary gene transcript profile for all control 
samples, all samples from infected rats and all samples 
from AHA-treated rats. 

Subtractions are performed between appropriate 
•individual libraries and the grouped libraries. For 

25 individual animals, control and post-study samples can be 
subtracted. Also, if samples are obtained before and after 
AHA treatment, that data from individual animals and 
treatment groups can be subtracted. In addition, the data 
for all control samples can be pooled and averaged. The 

30 control average can be subtracted from averages of both 
post-study AHA and post-study non-AHA cDNA samples. If 
pre- and post-treatment samples are available, pre- and 
post-treatment samples can be compared individually (or 
electronically averaged) and subtracted. 

35 These subtraction tables are used in two general ways. 

First, the differences are analyzed for gene transcripts 
which are associated with continuing hepatic deterioration 
or healing. The subtraction tables are tools to isolate 
the effects of the drug treatment from the underlying basic 

40 pathology of hepatitis. Because hepatitis affects many 



WO 95/20681 PCT/US95/01160 

parameters, additional liver toxicity has been difficult to 
detect with only blood tests for the usual enzymes. The 
gene transcript profile and subtraction provides a much 
more complex biochemical picture which researchers have 
5 needed to analyze such difficult problems. 

Second, the subtraction tables provide a tool for 
identifying clinical markers, individual proteins or other 
biochemical determinants which are used to predict and/or 
evaluate a clinical endpoint, such as disease, improvement 

10 due to the drug, and even additional pathology due to the 
drug/ The subtraction tables specifically highlight genes 
which are turned on or off. Thus, the subtraction tables 
provide a first screen for a set of gene transcript 
candidates for use as clinical markers. Subsequently, 

15 electronic subtractions of additional cell and tissue 

libraries reveal which of the potential markers are in fact 
found in different cell and tissue libraries. Candidate 
gene transcripts found in additional libraries are removed 
from the set of potential clinical markers. Then, tests of 

20 blood or other relevant samples which are known to lack and 
have the relevant condition are compared to validate the 
selection of the clinical marker. In this method, the 
particular physiologic function of the protein transcript 
need not be determined to qualify the gene transcript as a 

25 clinical marker. 

6.10. ELECTRONIC NORTHERN BLOT 

One limitation of electronic subtraction is that it is 
difficult to compare more than a pair of images at once. 
Once particular individual gene products are identified as 

30 relevant to further study (via electronic subtraction or 
other methods) , it is useful to study the expression of 
single genes in a multitude of different tissues. In the 
lab, the technique of "Northern" blot hybridization is used 
for this purpose. In this technique, a single cDNA, or a 

35 probe corresponding thereto, is labeled and then hybridized 
against a blot containing RNA samples prepared from a 
multitude of tissues or cell types. Upon autoradiography, 
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second set is indicative of one of the biological sequences 
of the second library. Then the second set of transcript 
sequences is processed in a programmed computer to generate 
a second set of identified sequence values, namely the 
5 further identified sequence values, each of which is 

indicative of a sequence annotation and includes a degree 
of match between one of the biological sequences of the 
second library and at least one of the reference sequences. 
The further identified sequence values are processed to 

10 generate further fi^al data values indicative of the number 
of times each further identified sequence value is present 
in the second library. The final data values from the 
first specimen and the further identified sequence values 
from the second specimen are processed to generate ratios 

15 of transcript sequences, which indicate the differences in 
the number of gene transcripts between the two specimens. 

In a further . embodiment, the method includes 
quantifying the relative abundance of mRNA in a biological 
specimen by (a) isolating a population of mRNA transcripts 

20 from a biological specimen; (b) identifying genes from 
which the mRNA was transcribed by a sequence-specific 
method; (c) determining the numbers of mRNA transcripts 
corresponding to each of the genes; and (d) using the mRNA 
transcript numbers to determine the relative abundance of 

25 mRNA transcripts within the population of mRNA transcripts. 

Also disclosed is a method of producing a gene 
transcript image analysis by first obtaining a mixture of 
mRNA, from which cDNA copies are made. The cDNA is 
inserted into a suitable vector which is used to transfect 

3 0 suitable host strain cells which are plated out and 

permitted to grow into clones, each cone representing a 
unique mRNA. A representative population of clones 
transfected with cDNA is isolated. Each clone in the 
population is identified by a sequence-specific method 

35 which identifies the gene from which the unique mRNA was 
transcribed. The number of times each gene is identified 
to a clone is determined to evaluate gene transcript 
abundance. The genes and their abundances are listed in 
order of abundance to produce a gene transcript image. 

8 
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the pattern of expression of that particular gene, one at a 
time, can be quantitated in all the included samples. 

In contrast, a further embodiment of this invention is. 
the computerized form of this process, termed here 
5 "electronic northern blot." In this variation, a single 
gene is queried for expression against a multitude of 
prepared and sequenced libraries present within the 
database. In this way, the pattern of expression of any 
single candidate gene can be examined instantaneously and 

10 effortlessly. More candidate genes can thus be scanned, 
leading to more frequent and fruitfully relevant 
discoveries. The computer program included as Table 5 
includes a program for performing this function, and Table 
6 is a partial listing of entries of the database used in 

15 the electronic northern blot analysis. 

6.11. PHASE I CLINICAL TRIALS 

Based on the establishment of safety and effectiveness 
in the above animal tests, Phase I clinical tests are 
undertaken. Normal patients are subjected to the usual 

20 preliminary clinical laboratory tests. In addition, 
appropriate specimens are taken and subjected to gene 
transcript analysis. Additional patient specimens are 
taken at predetermined intervals during the test. The 
specimens are subjected to gene transcript analysis as 

25 described above. In addition, the gene transcript changes 
noted in the earlier rat toxicity study are carefully 
evaluated as clinical markers in the followed patients. 
Changes in the gene transcript analyses are evaluated as 
indicators of toxicity by correlation with clinical signs 

30 and symptoms and other laboratory results. In addition, 
subtraction is performed on individual patient specimens 
and on averaged patient specimens. The subtraction 
analysis highlights any toxicological changes in the 
treated patients. This is a highly refined determinant of 

35 toxicity. The subtraction method also annotates clinical 
markers. Further subgroups can be analyzed by subtraction 
analysis, including, for example, 1) segregation by 
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occurrence and type of adverse effect; and 2) segregation 
by dosage. 

# 

6.12. GENE TRANSCRIPT IMAGING ANALYSIS IN CLINICAL STUDIES 

A gene transcript imaging analysis (or multiple gene 
5 transcript imaging analyses) is a useful tool in other 
clinical studies. For example, the differences in gene 
transcript imaging analyses before and after treatment can 
be assessed for patients on placebo and drug treatment. 
This method also effectively screens for clinical markers 
10 to follow in clinical use of the drug. 

6.13. COMPARATIVE GENE TRANSCRIPT ANALYSIS BETWEEN SPECIES 

The subtraction method can be used to screen cDNA 
libraries from diverse sources. For example, the same cell 
types from different species can be compared by gene 

15 transcript analysis to screen for specific differences, 
such as in detoxification enzyme systems. Such testing 
aids in the selection and validation of an animal model for 
the commercial purpose of drug screening or toxicological 
testing of drugs intended for human or animal use. When 

20 the comparison between animals of different species is 

shown in columns for each species, we refer to this as an 
interspecies comparison, or zoo blot. 

Embodiments of this invention may employ databases 
such as those written using the FoxBASE programming 

25 language commercially available from Microsoft Corporation. 
Other embodiments of the invention employ other databases, 
such as a random peptide database, a polymer database, a 
synthetic oligomer database, or a oligonucleotide database 
of the type described in U.S. Patent 5,270,170, issued 

30 December 14, 1993 to Cull, et al., PCT International 

Application Publication No. WO 9322684, published November 
11, 1993, PCT International Application Publication No. WO 
9306121, published April 1, 1993, or PCT International 
Application Publication No. WO 9119818, published December 

35 26, 1991. These four references (whose text is 

incorporated herein by reference) include teaching which 
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may be applied in implementing such other embodiments of 
the present invention. 

All references referred to in the preceding text are 
hereby expressly incorporated by reference herein. 
5 Various modifications and variations of the described 

method and system of the invention will be apparent to 
those skilled in the art without departing from the scope 
and spirit of the invention. Although the invention has 
been described in connection with specific preferred 
10 embodiments, it should be understood that the invention as 
claimed should not be unduly limited to such specific 
embodiments . 
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TABLE 2 



Clone numbers 15000 through 20000 

Libraries: HUVEC 

Arranged by ABUNDANCE 

Total clones analyzed: 5000 

319 genes, for a total of 1713 Clones 
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TABLE 4 



Libraries: THP-1 

Subtracting: HMC 

Sorted by ABUNDANCE 

Total clones analyzed: 7375 



1057 genes, for a total of 2151 clones 
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TABLE 4 Con ' t 
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TABLE 5 



* Master menu for SUBTRACTION output 
SET TALK OFF 

SBT SAFETY OFF 
SET EXACT ON 
SET TYPEAHEAD TO 0 
CLEAR ' 

SET DEVICE TO SCREEN 

USE • ■ SmartGuy :Po*BASE+ /Mac ; fox files : Clones. cbf" 
GO TOP * 

STORE NUMBER TO INITIATE! 
GO BOTTOM 

STORE KUMBER TO 'TERMINATE 
STORE 1 1 TO Targetl 

STORE * • TO Target2 

STORE 1 TO Target3 

STORE. ' 1 TO Object 1 

STORE • ' 'TO Object 2 

STORE ' • TO Object 3 

STORE 0 TO ANAL ' 
STORE 0 TO EMA1CH 
STORE 0 TO HMATCH 
STORE 0 TO CMATCH 
STORE 0 TO THATCH 
STORE 0 TO PTP 
STORE 1 TO BAIL 
DO WHILE .T. * 

* 'Program, i 'Subtraction 2.£nt 
Data.... i .10/11/94 . ... 

• * Version, i FoxBASE+/Kac, revision 1.10 

* Notes., i.: Fornafc file Subtraction 2 
.* 

SCREEN 1 TYPE 0 HEADING •Screen V AT 40,2 SIZE 286,492 PIXELS PONT "Geneva', 9 COLOR 0,0,0, 
8 PIXELS 75,120 TO 178,241 STYLE 3871 COLOR 0,0,-1,24610.-1,6947 

8 PIXELS 27,134 SAY 'Subtraction Menu - STYLE 65536 FONT "Geneva' , 274 COLOR 0,0,-1^-1, -1,-1 
0 PIXELS. 117, 126 GET EMATCH STYLE 65536 FONT •Chicago" ;12 PICTURE "8*C Exact 1 SIZE 'IS; 62 '00 
8 'PIXELS 135,126 GET HMATCH 'STYLE 65536 FONT • Chicago - ,12 .PICTURE *Q*C Homologous * • SIZE .15 , 1 
8 PIXELS 153,126 GET CMATCH STYLE 65536 FONT "Chicago", 12 PICTURE «9*C Other epc" SIZE 15,84 
6 PIXELS 90,152 SAY "Matches*". STYLE 65535 PONT •Geneva 1 , 12 COLOR 0,0,rl,-l, -1,-1 
8 PIXELS 171,126 GET Imateh STYLE 65536 FONT •Chicago M2 PICTURE "8*0 Xncyte 1 SIZE -15,65 CO 
8 PIXELS 252,137 GET initiate STYLE 0 FONT •Geneva", 12 SIZE 15,70 COLOR 0,0, -1, -1,-1,-1 
a PIXELS 252,236 GET terminate STYLE 0 FOOT *Geneva",12 SIZE 15,70 COLOR 0,0,-1,-1,-1,-1 
8 PIXELS 252,35 SAY "Include clones** STYLE 65536 FONT 'Geneva", 12 COLOR 0,0*-l, -1,-1,-1 
Q PIXELS- 252,215 SAY "->" STYLE* 65536 FONT 'Geneva", 14 COLOR 0,0,-1,-1,-1,-1 . 
"@ PIXELS -198,126 GET PTF STYLE 65536 FOOT •Chidago',12 PICTURE -@*C .Print to file" SIZE I5',S 
8' PIXELS 90,9 TO 1^1,109 STYl£ 3871 COLOR 0,0,-1,-25600,-1,-1 
. a PIXELS 90,38*8 TO'181,397 STYLE 3871 COLOR 0,0,-1,-25600,-1,-1 
8 PIXELS 81,296 SAY "Background: - STYLE 65536 FCNT "Geneva", 270 COLOR 0,0,-1,-1,-1,-1 
8 PIXELS 45,135 GET ANAL 9TYLE 65536 FONT -Chicago" ,.12 PICTURE "8MI Overall } Function" * SIZE 4 
8 PIXELS 81,56 SAY "Target:" STYLE 65536 FONT "Geneva'^O COLOR 0,0,-1,-1,-1,-1 
8 PIXELS 108,20 GET target! STYLE 0 FCWT "Geneva"',? SIZE 12,79 COLOR 0,0, fl# -1, -1,-1 
-8 PIXELS 135,20 GET target2 STYLE 0 FONT "Geneva", 9 SIZE 12,79 COLOR 0,0,-1,-1,-1,-1 
.8 PIXELS 162,20 GET target3 STYLE 0 FOOT 'Geneva"*9 SIZE 12,79 COLOR 0,0,-1,-1,-1,-1 
8 PIXELS 108,299 GET object 1 STYLE 0 FONT "Geneva*, 9 SIZE 12,79 -COLOR 0,0,-1,-1,-1,-1 
8 PIXELS 135,299 (SET ebject2 STYLE 0. FONT "Geneva",* SIZE 12,79 COLOR 0,0,-1,-1,-1,-1 
6 PIXELS 162,299 GET 6bject3 STYLE 0 FONT "Geneva", 9 SIZE 12,79 COLOR 0,0,-1,-1,-1,-1 
*8 PIXELS 276,324'GET Bail STYLE 6S536 FOOT "Chicago", 12 PICTURE "8*R Run; Bail out" SIZE 4112 

* ... * 

* EOF! Subtraction. 2. fmt 

READ ■ 
IF Bail»2 
CLEAR 

CLOSE DATABASES 

USE ' Smart Guy : FoxBASE* /Mac i fox files : clones, dbf" 
.SET SAFETY ON 
SCREEN. 1 OFF 
RETURN 
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EMDIP 

STO*E VAL(SYS(2)) 10 STARTIME 

STORE UPPER (Target*). TO Targetl 
STORE. UPPER (Target 2) TO Target2 
STORE UPPER (Target3) TO Target 3 . 
STORE UFPBR(Qbjectl) Tp Object 1 
STORE UPPER (Objects) TO Object 2 
STORE UPPER(0bject3) TO Object3 
clear 

SET 7$LX OK . 

GAP s TERMINATE- INITIATE* 1 
GO INITIATE 

COPV NEXT GAP FIELDS NUM3ER, library, D, P, 2, R, E>TCOT,S, DESCRIPTOR, START, RFEND, I TO TEMPNUM 
USB TEMPNUM 
COUNT TO TOT 

COPY TO TEMPRED FOR Ifc'E 1 .OR.D= '0» .OR.D='H' .OR.D^N 1 .OR.D»' I' 
USE TEMPRED 

■ 

IP Bnatch=0 .AND. &natch=0 .2ND. Onatch»0 .AND. IMATCH=0 
COPY TO TEMPDESIG 

COPY STRUCTURE TO TEMPDESIG 
USE TEMFDESIG 
IF Bnatchsl 

APPEND FROM TEMPNUM FOR Dx'E* 
END2F 

17 ' Hmatch=l 

APPEND FROM TEMPNUM FOR D^'H 1 
INDTF 

IF Cmatch=l 

APPEND FRbii TEMPNUM FOR D= '0' 
ENDIF 

IF Utatchsl 

APPEND FROM TEMPNUM FOR D= 'I ' •OR.D= , X' 
^OR.Da'M' 

. SJDIF 
STDIF 

COUNT TO STARTOT 

COPY STRUCTURE TO TEMPLIB 
■USE T&!PIjIB ■ • 

append FROM tempdesig for library *upper (targetl ) 
IP target2o» . 1 

APPEND^ FROM TEMPDES1G FOR library=UPPER (target 2 ) 

hjdif ' 

IF target3<y' 1 • 

APPEND FROM. TEKPDESXG FOR library-OFTER (target3 ) 

em dif 

COONT TO ANALTOT 

USE TEKFDZ5IG 

COPY STRUCTURE TO OEMPSUB 

USE TEMPSUB 

APPEND FROM TEMPDESIG FOR library=UPFER(Objectl) 
IP tar^et2o' 1 

APPEND FRCM TEMPDESIG FOR. libra ry"=U7FER( Object 2) 
ERDIF 

IF target3<>' 

'APPEND FRCM TEMPDESIG FOR library=UPPER (Cbj ec t3 ) 

ENDIF 
COUNT TO SOBTRACTOT 
SET TALK OFF 



* COMPRESSION SUBROUTINE A 
? 'COMPRESSING' QUERY LIBRARY 
USE TEHPLXB 
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SORT *CN' ENTRY, NUMBER 00 LIBSORT 

USE LIBSORT 

COUNT TO IDGENE 

REPLACE AU, RFEND WITH 1 

MARK1 « I 

SW2=0 

DO WHILE SW2-0 ROLL 
IF MAKK1 >= IDGENE 
PACK 

COUNT TO AUNIQUE 

LOOP 

ENDIF 
GO M&RX1 
CUP s 1 

6 TORS JOTRY TO TESTA 
STORE D TO DKSIGA . 

sw » o 

'DO WHILE SW=0 .TEST 
SKIP 

STORE EtfTRtf TO TES2E 
STORE D TO D3SIGB 

IF TESTA e TESTE » AND, DSSIGAbDESIGB 

DELETE 

DUP = EUP+1 

LOOP 

EKDIF 
GO KARXl 

REPLACE RFEND WITH CUP 

MARKI - MfcRXl+DUP . 

SWsl 

LOOP 

ENDDO. TEST 
LOOP 

ENDDO ROLL 

SORT GN RFg^p/D « NUMBER TO TEMP^ARSOKT . 
USE TE>£PIARSORT 

♦REPLACE ALL START WITH RF£ND/IDGa*E*10QOO 
C0QB7T TOTEMPTARCO 

tt**»*tfltftft»ff«f«tVf*fftft»*«tttttMMfft«ttMttttt*tttt«tttt«Mtttttt*«lt*t*tt*»4*HA«HHi 

♦ COMPRESSION SUBROUTINE B 
? 'CCtePRSSSDW TARGET LIBRAKSf' 
USE .TEMPSUB 

SORT ON ENTRY, NUMBER TO'SUBSORT 

USE SUBSORT 

COUNT TO SOBGENE 

REPLACE ALL RFHID WITH I 

MMOCL c 1 

SW2«0 

• DO WHILE SW2=0 ROLL 
IF MARSO. >= SUE GENE 
PACK • 

COUNT TO BUNIQUE 

SW2sl 

LOOP 

ENDIF 
GO MARKl • 
DUP ■ 1 

STORE, ENTRY TO TESTA 
STORE D TO DESIGA 
SW = 0 • 
DO WHILE £W=0 TEST 
SKIP 

STORE EtfTRY TO. TESTE . . 

STORE D TO DSSIGB 

IF TESTA = TESTB . AND . DESIGA^DESIGB 
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DELETE 
DUP u DOP+1 
LOOP 
END I? - 
GO HARX1 

REPLACB RFEND WITH DUP 
MARKl » MARKl+IOT 

LOOP 

EKDDO TEST 
l£OP ; 

ENDDO ROLL 

SORT ON RFEND/D, NUMBER TO TEKPSUBSORT 
•USE TEMP5UB90RT 

* REPLACE ALL START WITH RFEWD/JDGENE*10000 
COUNT TO TEMPS0BCO 



♦FUSION ROUTINE 

? 'SUBTRACTING LIBRARIES 1 

USE S UBTR ACTION 

COPY STRUCTURE TO CRUNCHER 

SELECT 2 

USE SEMPSUBSORT 

SELECT 1* 

USB CRUNCHER 

APPEND FROM TEMPTARSORT 

COUNT TO BAILOUT 

HARK - 0 

DO TOOLS .T.. 
SELECT 1 
MARK b MARKtl 

IF MARX>BAILOUT 

EXIT 

•GO MARK 

STO gE^ ENIRY TO SCANNER 
S3LSCT 2 

LOCATE. TOR ENTRY=jSCAMNER 
IP FOUND () . 
STORE RFEND TO BIT1 
STORE RFEHD TO BZT2 

STOR E 1/2 TO BUI 
STORE 0 TO BIT2 
ENDIF 
SEL^DCT X 

REPLACE BGFRBO WIIH BIT2 
REPLACE ACTUAL WITH BXT1 
LOOP 
SSDDO 

t 

SELECT 1 . 

REPLACE ALL RATIO WITH RFEND/AOTUAL 

5 1 DOING PINAL SORT BY RATIO 1 

SORT ON RATIO/D, BGFREQ/D, DESCRIPTOR TO FINAL 

USE FINAL 



eet talk off 

DO CASE. 

CASE PTPsO" 

SET DEVICE TO PRINT 

SET PRINT ON 

EJECT .*•. 

CASE 7TF=l 

SET ALTERNATE TO "Adenoid .Patent Figures : Subtraction . txt ■ 
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SET ALTERNATE CM 
ZNDGA5S 

STORE VALtSYS<2)>' TO P3HT1ME 

IF FINTIME<STARTIMS 

STORE FXNTIHE+86400 70 .FINrBffl 

ENDIP 

STORE FBJT2ME - STARTIME.TO OCMPSEC 
STORE CCMPSEC/60 TO COMPMIK 

SET MARGIN TO 10 

01,1 BAY •Library Subtraction Analysis" STYLE 65536 PONT "Geneva\274 COLOR 0,0,0,-1,-1 

7 
? 
7 
7 

9 dated 
7? TDdEO 

7 'Clone nuribers 1 

'7?:CTRCNITIATE,5,0) 

.7? ■ through ' ■ • 
7? STR (TERMINATE, 6, 0) 
7 'Libraries i 1 
7 rcargetl 
IP TargetSo' 

7? TargeW 
ENDIP 

IP Target3o 1 

?? 1 1 1 
7? Target3 

£NDI? * ' 

7 'Subtracting; 

7 Objectl 

IF-0bject2o' 

??•',.' 

77 Objects 

ENDIP 

IF Qbject3<>' 
?? ' t 1 
7? 0bject3 
ENDIP . 

•7 ' Designations r .' 

IP Eraatch=0 .AND. Hmatch=0 .AND. Omabch=0" .AND. IMATCH=0 

?? 'All' 

ENDIP .. 
IF Etatchal 

?? *acaet,' 

ENDIF 

IF Hmatch=l 
77 'Human, ' 

ENDIP 

'IF Qnatchsl 
?? 'Other ep. T 

endip 

IF Imatch"! 
??.'xnctoe' 

•IP ANAL=1 

7 'Sorted Icy ABUNDANCE '• 

ENDIF- 

IF ftNAL-2 

7 'Arranged by FUNCTION' 
ENDIP 
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•? 'Total clones represented: 1 

?? STR(TCT,5,0) 

? 'Total -clones analyzed: 1 

?? 6TO(STARIOT,5,0) 

? 'Total, confutation, time: 
• ??• STR(C0MFWtN/ 5,2) - 

?? > minutes' ' 

7' 'd » designation £ a distribution * = location, r = function b a species i « inte 



SCREEN 1 TYPE 0 HEADING "Screen 1* AT 40,2 SIZE 386,452 PIXELS FOOT 'Geneva', 9 COLOR 0,0.0. 

DO CASS . *' ' 
CASE ANALsl 
?? STR(AUNXQUE,4,0) 

'7? 1 genes , for a total of 1 • 
.,?? STR(ANAiaOT,i4,0) * 
7? 1 clones' 

? • • 

SCREEN 1 TYPE 0 HEADING •Screen 1" AT 40,2 SIZE 286,492 PIXELS FONT "Geneva", 7 COLOR 0,0,0, 
list OFF fields nurnber,D,F,Z,RjENIEY,S,BESCRIP^^ 
SET PRINT OFF' 
CLOSE DATABASES . 

•USE . " SniartGuy ; FoXBASE+/Mae : fox files: clones, dbf" 

CA6E.aNAL*2 
** arrange/function 

SETT PRINT* CN 
SETT HEADING ON 

SCREEN 1 TYPE. 0 HEADING "Screen 1" AT 40,2 SIZE 286,492 PIXELS ' FOOT "Helvetica 1 , 268 COLOR 0 
?■'"'• 
'? • BINDING PROTEINS' 

? • 

SCREEN 1 TYPE 0 HEADING "Screen If AT 40', 2 SIZE 286,492 PIXELS FOOT "Helvetica" ,265 COLOR 0 
? •surface molecules end receptors i 1 . 

SCREEN 1 TYPE 0 HEADING "Screen 1" AT 40,2 SIZE 286,492 PIXELS FONT "Geneva* ,7 COLOR 0-0,0, 
lifit OFF fields nurriber,D/F;Z,H,ENraY,S,iroCRIPTO^ FOR R='B' 

• • 

•SCREEN 1 TYPE 0 HEADING "Screen 1* AT 40,2 SIZE 286,492 PIXELS .FOOT "Helvetica" ,265 COLOR 0 
? 'Calcium-binding proteins:' 

SCREEN 1 TOE 0 HEADING "Screen 1" AT 40,2'SIZE 286,492 PIXELS FONT B Geneva\7 COLOR 0,0,0, 

list OFF fields number, D, F,Z,R, ENTRY, S, DESCRIPTOR ,BGFREQ,RFEND, RATIO, I FOR Ra'C 

SCREEN 1 TYPE 0 HEADING "Screen 1" AT 40,2 SIZE 286,492 PIXELS FONT "Helvetica" ,265 COLOR 0 
7 'Liganda 'and effectors i ! 

SCREEN 1 TYPE 0 KEtoCNG "Screen 1" AT 40,2 SIZE 286,492 PIXELS FONT "Geneva", 7 COLOR 0,0,0, 
list OFF fields number t D, F, Z , R, ENTRY, s , DESCRIPTOR, BGFREQ , WEND^ RATIO , I FOR R='S* 

SCREEN 1 WPE 0 HEADING "Screen 1' AT 40,2 SIZE 286,492 PIXELS FONT "Helvetica" ,265 COLOR 0 
7 'Other binding proteins:' 

SCREEN 1 TYFE-0 HEADING "Screen 1' AT'40,2 SIZE 286,492 PIXELS FCNT "Geneva", 7 COLOR 0,0,0, 

list OFF fields'nuniber,D,F f Z,R,QmiY,S,I^ FOR RsT ' 

7 . ■ . ■ 

SCREEN 1 TYPE 0 HEADING 'Screen 1" AT. 40,2 SIZE 286,492 PIXELS FONT "Helvetica" ,268 COLOR 0 

? ' . ONCOGENES' 

7 . * 

SCREEN 1 TYPE 0 HEADING 1 Screen 1" AT 40 t 2 SIZE 286,492 PIXELS FONT ' "Helvetica" , 265 COLOR 0 
7 'General oncogenes! 1 r + 

SCREEN 1 TYPE 0, HEADING "Screen 1" AT .40,2 SIZE 286,492 PIXELS .FOOT "Geneva"*,? COLOR 0,0,0, 
list OFF fields number, D> F, Z,R, ENTRY, S, DESCRIPTOR, BGFREQ,RFEND, RATIO, I FOR R='0' 

■ 

* 

SCREEN 1 TYPE 0 HEADING "Screen 1" AT 40,2 SIZE 286,492 PIXELS FONT "Helvetica* ,265 COLOR 0 
7 'GTP-binding proteins i 1 

SCREEN 1 TYPE 0 HEADING "Screen 1" AT 40,2 SIZE 286,492 PIXELS FONT "Geneva 1 ',? COLOR 0,0,0, 
list OF? fields number, D,F,Z,R, ENTRY, S, DESCRIPTOR, BSFREQ,RFEND f RATIO,! FOR Rs'G' 
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SCREEN" 1 TYPE 0 HEADING 'Screen !• AT 40,2 SIZE 206,492 PIXELS FOOT "Helvetica", 265 COLOR 0 
? 'Viral elements! • • 
SCREEW 1 T*PE 0 HEADINC? -Screen 1" AT. 40,2 SIZE 286,492 PIXELS FONT "difiW?^ , 7 C&&R % 8 fl 0* 
list OFF fields number , D,F, Z, R, ENTRY, S, DESCRIPTOR, BGFREQ, RFEND, RATIO, I FOR R= ; v 

SCREEN 1 TVPE 0 HEADING 'Screen 1- AT 40,2 SIZE 286',492 PIXELS FOOT 'Helvetica', 265 CdLOR 0 
? 'Kinases end Phosphatases!' . ■ u 

SCSEEN 1 TYPE 0 HEADING -Screen 1- AT 40,2 SIZE 286,492 PIXELS FONT 'Geneva', 7 COLORCO.O 
list OFF fields nuinber,D,F, Z, REENTRY, S, DESCRIPTOR, BGFREQ,RFE^ f RATIO, I FOR Rs'Y"' 

SCREEN l'TVPE 0 HEADING "Screen 1' AT 40,2 SIZE 286,492 PIXELS FGNT 'Helvetica"' 265 COLOR 0 
? 'Tumor-related wtigensi 1 ■ 

SCREEN 1 WPE 0 HEADING .'Screen 1' AT 40.2 SIZE 286,492 PIXELS FONT "Geneva-, 7 COLOR 0,0.0 

li'St OFF'fxelds n\^er,D,F,Z / R,ENTRV,s,i^cRIPTOR,BCFi^,RFEND,RATIO,I FOR R='A' 

?. 

SCREEN 1 TYPE 0 HEADING 'Screen 1' AT 40,2 SIZE 286,492 PIXELS FQMT 'helvetica' 268 COLOR 0 
?' . PROTEIN SYttXHETIC MACHINERY- PROTEINS ' ' 

SCREEN 1 TYPE 0 HEADIN3 "Screen 1" AT 40,2 SIZE 286,492 PIXEL9 FONT 'Helvetica' ,265 COLOR 0 
7 "Transcription and Nucleic Acid-binding proteins i ' * * w 

SCREEN 1 TYPE 0 HEADING 'Screen 1' AT 40,2 SIZE 286,492 PIXELS FONT 'Geneva", 7 COLOR OOO 
list OFF fields nun^,D,F f Z,R,a7IKY,SiraSCRIPTOR,BG^^ TOR R='D« ' 

* * * * 

SCR^^^PE^^ HEADING -Screen 1" AT 40,2 SIZE 286,492 PIXELS * FONT 'Helvetica' ,265 COLOR.O 

??R£EN1 TYPE 0 HEADING 'Screen 1- AT 40,2 SIZE 286,492 PIXELS FONT 'Geneva", 7 COLOR CLO.O." 
last OFF fields number, D,F,Z,R>ENrRY,S,DESCRIPTOR f BGFREQ,RFIiID,RATIO, I 50R R='T' 

SCREEN 1 TYPE 0 HEADING 'Screen 1' AT' 40,2 6IZE 286,492 PIXELS FOST ^Helvetica' ,265 COLOR 0 
? 'Rib osaTtal proteins:' . • * «««\ v 

?F!FSLS 5PL° }TJ ^ mG Screen 1' AT 40,2 SIZE 286,492 PIXELS FONT -Geneva\7 COLOR O.Oio, 

liflt off fields nun±>er,D;F,z,R,E^ma l s,^EBCM^ tor 

* 

SCREEN 1TYPE 0 KEADINS ."Screen 1" AT 40,2 SIZE 286,492 PIXELS FONT "Helvetica"^ 65 COLOR 0 
? 'Protein processing: ' ' 

i??^ TF?J* 1" AT 40,2 SIZE 286,492 PIXELS FONT "Geneva»' ( 7 COLOR-0,0,0, 

list OFF fields lIumber,D,F,Z,R,E^mlY,S / C^^ FOR rJl ; 

.SCREEN 1 TTOE 0 HEAOD3G "Screen 1' AT 40,.2 SIZE 286,492 PIXELS . FONT "Helvetica \ 268 *COtOR 0 

? ' ENZYMES' 
? 

SCREEN 1 TYPE 0 HEADING 'Screen. 1- AT 40,2 SIZE 286.492 PIXELS FONT "Helvetica' ,265 COLOR 0 
?• 'FexTOproteins t ' 

f?*F£,i ?*,J> HE ^ ING ^ , !'^ e S a -. 1 ' * T 40,2 SIZE 28S ' 492 VTXELS FOOT "Geneva ',7 COLOR 0,0.0. 
list OF? fields «uniber,D,P,2,R,SinTO,S,raSC»IWCR,8<SFREQ < RFnTO,RfkTI0,I FOR R= ! P' 

SCREEN 1 T5fPE 0 HEADING • Screen- 1* AT 40,2 SIZE 286,492 PIXELS FOOT 'Helvetica' ,265 COLOR 0 
? 'Pro teases and inhibitors;' . • • ' ••■ *> SJl * m v 

1 0 HEADING 'Screen 1- AT 40,3 SIZE 286,493 PIXELS TOOT 'Geneva', 7 COLOR 0,0.0, 

list 05? fields nuna^;0,F,Z,R,EOT8X,S,DKCMWC»,^ FOR rJ p . ' ' ' 

f^ditivfpnoSSglaSr. }' * T 40 ' 2 6IZB 28S ' W2 PEELS '^vetica'.jes COLOR 0 ' 

SCREEN 1 TYPE 0 HEADINto -Screen 1" AT 40,2 SIZE 286,492 PIXELS FOOT "Geneva 1 .? COLOR 0 0 0 
list OFF f ields number, D,F,Z,R,£OTRY, 6, DESCRIPTOR, BGFREQ,RFEND, RATIO, I FOR R='Z' * 

• • • . 

SCREEN 1 TYPE 0 HEADING' " Screen 1" AT 40,2 SIZE 286,492 PIXELS FONT "Helvetica ",265 COLOR 0 
7 'Sugar metabolism: ' • ' • ^ u 

f?^™ ?FL 0 HEA ? IK0 "Screen 1" AT 40,2 SIZE 256,492 PIXELS FONT *Geneva\7 COLOR 0,0,0, 
list OFF fields nuitber,D # F,Z,R,E*?m, 9, descriptor for Rs»Q' . 

6CREEN*1 TYPE 0 HEADU50 "Screen 1" AT 40,2 SIZE 286,492 PIXELS FONT "Helvetica" ,265 COLOR 0 
7 'Amino acid metabolism: ' ' ^ 

SCREEN 1 TYPE 0 HEADING "Screen 1" AT 40,2 SIZE 286,492 PIXELS FONT "Geneva" ,7 COLOR 0,0,0/ 
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list OFF fields number, D, F, Z, R, ENTRY, S, DESCRIPTOR, BGFRBQ, RFEWD, RATIO,. I FOR R='M' 

SCREEN 1 TYPE 0, HEADING "Screen 1' AT 40,3 SIZE 286,492 PIXELS PCOT •llefvefici",^ 5$QR 0 
? 'Nucleic acid metabolism: •* . . , 

SCREEN 1 .TYPE 0* HEADING "80:6611*1* AT 40,2 SIZE 286,492 PIXELS FCOT "Geneva*,? COLOR 0,0,0/ 
list OFF 'fields number , D, F, Z, R, ENTRY, 6, DESCRIPTOR, BGFREQ, rfqjd, RATIO, -I for Ro»N» 

'SCREEN '1 TYPE 0 HEADING "Screen 1' AT 40,2 SIZE 286,492 PIXELS' FOOT 'Helvetica*, 2 65 COLOR 0 
? 'Lipid metabolism: ' • 

SCREEN 1 TYPE 0 HEADING "Screen 1" AT 40,2 SIZE 286,492 PIXELS PCOT "Geneva", 7 COLOR 0,0,0, 
list OFF fields number, D,F,Z,R,S5TRy,S, DESCRIPTOR, BGFREQ, RFEND, RATIO, I FOR Rs<W* 

SCREEN 1 TYPE 0 HEADING "Screen 1" AT 40,2 SIZE 286,492 PIXELS FOOT "Helvetica", 265 COLOR 0 
? 'Other enzymes:' , * 

SCREEN 1 TYPE 0 HEADING 'Screen 1" AT 40,2 SIZE 286,492 PIXELS FONT "Geneva", 7 COLOR 0,0,0, 
lifit OFF fields number, D,F, Z f R,ENITCY,S # DESCRIPTOR, BGFRBQ, RFEND, RATIO, I FOR Rs'E 1 
? ■ • 

SCREEN 1 TOPS 0 HEADING "Screen 1" AT 40,2 SIZE 286,492 PIXELS FONT "Helvetica", 268 COLOR 0 

?. *. • ' . 

? ' MISCELLANEOUS CATEGORIES' 

7 

SCREEN 1 TYPE 0 HEADING "Screen 1" AT 40,2 SIZE 286,492 PIXELS FOOT "Helvetica" ,265 COLOR 0 
? 'Stress responser ' * * 
SCREEN -1 TYPE 0 HEADING "Screen 1" AT 40,2 SIZE 266,492 PIXELS FONT "Geneve", 7 COLOR 0,0,0, 
list OFF fields nwiber,D,FvZ,R,EmY,S,r^RlPT0R,BGFI^ / RF^,miO,I FOR R='H' 

SCREEN 1 TYPE 0 HEADING "Screen 1" AT 40,2 SIZE 286,492 PIXELS FOOT "Helvetica", 265 CGLOR'O 

? 'Structural: ■ • . • 

SCREEN 1 TYPE 0 HEADING "Screen 1" AT 40,2 SIZE 286,492 PIXELS FOOT "Geneva", 7 COLOR 0,0,0, 

list OFF fields number , D, F , 2 , R, ENTRJf , DESCRIPTOR, BGFREQ, RFEND , RATIO , I '.FOR R='K' 

• * * « * 

SCREEN 1 TYPE 0 HEADING "Screen 1" AT 40i2 SIZE 286,492 PIXELS FOOT -Helvetica", 265 COLOR -0 
?• 'Other clones:' * 

SCREEN 1 TYPE 0 READING "Screen 1" AT 40,2 SIZE 286,492 PIXELS * FONT" "Geneva", 7. COLOR 0,0.0. 
list OFF fields nuniber,D,F,Z,R,£KrRY,S,t^(^PrOR,BGFR£0,RF^,l^TIO,X FOR R='X' 

SCREEN 1 TYPE 0 HEADING "Screen 1" AT 40,2 SIZE 286,492 PIXELS FOOT "Helvetica", 2 65 COLOR 0 
? ' Clones ■ of unknown function s ' • . 

SCREEN 1 TYPE 0 HEADING "Screen 1" AT 40,2 SIZE 286,492 PIXELS FOOT "Geneva", 7 COLOR '0,0,0, 

list OFF fields number, d,f,z,r,eniw,s,descriptor,bgfreq,rjemd, ratio, i for r*'U' 

EWDCASE 

DO "Teat print .pro" 

set print off 

SET DEVICE TO SCREEN 
CLOSE DATABASES 
ERASE TQ4PLIB . DBF 
ERASE TEMPNUM.DBF 
ERASE TEMPDESIG.DBF 
SET NARGIN TO 0 
CLEAR 
LOOP 
DJDDO 
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•Northern (single) , version 11-25-94 

close databases 

SET TAIJC OFF 

SET PRINT OFF ' 

5ET EXACT OFF 

CLEAR * 

STORE » 10 Eeltfect 

STORE 0 TO Numb. 
STORE 0*TO ZOQ 
STORE 1 TO Bail 
DO WHILE ,T. 

.* Program, i Northern (single) .fmt 

* Date.*..: 8/ 8/S4 * ■ 

* Version, t .FoxBASE+/Mac,' revision i.10 

* Notes. v...i .Format file Northern (single) 



SCREEN 1 TYPE 0 HEADING 'Screen 1- AT 40,2 SIZE 286,492 PIXELS raff •fi fl r, PVS . 10 ™rr« 'a a * 
ft PIXELS 15,81 TO 46,397 STYliE 28447 COLOR 0, 07-1,-25600; fffS 6«wva-,.ia COI0R 0,0,0 

f 89 ' 79 TO 192 ' 422 28447 «*« 0,6,0 -25600 -1 -1 

! JH r ?S,% l " a S y #,i STYLE 65536 KOT 'Geneva -12 COLOR 0,0,0,-1 -1 -1 

I HI'iZ 3 GET 0 FONT -Geneva-, 12 SIZE is7l42 COLOR 0 0 0 -1 -1 -1 



S PIXELS 60,152 SAY "Enter any ONE of the following!" <5536 m &ta*; J COLOR -r, 

* EOF: Northern I single' ). fint 
BEAD 

IF Bails* 
CLEAR . 
screen 1 off 

■retorn 

EHDIF 

USB "SmartGuy iFoxBASE+/Hac;Fox fileB t Lookup .dbf" 
SET TALK CN 



IF Eobjecto' 

STORE UPPER (Eobject) to Eobject 
SET SAFETY OFF . 

SORT .O N En try to "Lookup entry. dbf • 

SET SAFET3T ON 

USE "Lookup entry, dbf " 

LOCATE FOR LookcEobject 

If ..not.foundo * 

CLEAR 
LOOP 

indtf 

BROWSE 

STORE Entry TO Searchval* 

CLOSE DATABASES 

ERASE ."LooJcup entry. dbf ■ 

ENDIF 

•IP Dobjecto' • 
SET EXACT OFF 
SET SAFETY OFF 

SORT' ON descriptor TO 'Lookup* descriptor, dbf" 
SET SAFETY On * 

USE "Lookup descriptor. dbf* 

ICCATE FOR UPPER(TRIM{descriptor) )=UPPER(TRIM(Dobiect) ) 

IF .NOT. FOUND () 

d£AR 
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LOOP 
BROWSE 

STORE Entry TO Searchval 

CLOSE DATABASES . ' 

ERASE 'Lookup descriptor.dbf " 

S ET EX ACT CN 

ENE3ZF • 

■ 

IP NuroboO 

CSS ■SmartGuyiFoxBA£E+/Mac:Fox files ; clones. dbf> 

GO NUmb 

BROWSE 

.STORE Entry TO Searchval 
X23DXF 

CLEAR, 

? 'Northern analysis for entry • 

?? Searchval 

? . • .* 

7 'Eacer y to proceed* 

WAIT TO CK • 

CLEAR 

IF UPPER (OX) o'V 
scre en 1 off 
RETURN 
ENDIF 

< COMPRESSION * SUBROUTINE FOR Library, dbf 

? 'Coapressing the Libraries file now 

USB *6toartGuy:FcxBASE+/Mac:Fox files: libraries. dbf 

SET SAFETY OTP , 

SORT ON library TO 'Compressed libraries. dbf* . 

* FOR ente red>0 ' 
SET SAFETY ON 

USE 'Compressed libraries-. dbf f 

DELETE FOR enter edsr'O 

PACK 

COUNT TO TOT 
MfcRKl = 1 
SW2=0 . 

DO WHILE SW2=0 ROLL 
• IF MAR?;l >o TOT 
. PACK , * 
SW2=1 
LOOP 
ENDXF 

GO KARXl . 

* STORE library TO TESTA 
'SKIP 

Store Library TO TESTB 
IF TESTA = TESTB 

ENDIF 

MARK1 - MARKL+1 1 
LOOP * 
EUDDO ROLL 

■ 

♦ 

* Northern analysis 
CLEAR 

? 1 Doing the northern new. • . 
SET TALK CN 

USE ' smartGuy i Fox3ASE+/Mac t Fox filestclones.dbf 

QHyy SATOTy 

COPV TO "Hits, dbf FOR ent^yosearehval 
SET SAFEfXY CN 



WO 95/20681 



PCT/US95/01160 



* MASTER ANALYSIS 3j VERSION 12-9-94 

* Master menu for analysis output 
CLOSE DATABASES 

SET TALK OFF 
SET SAFETY OFF 
CLEAR 

SET DEVICE TO SCREEN 

SET DEFAULT TO "SmartGuyiFoxBASE-H/Mac: fox files:Output programs:" 
USE u SmartGuy:FoxaASE+/Mac:fox f iles: Clones. dbf" 
GO TOP 

STORE NUMBER TO INITIATE 
GO BOTTOM 

STORE NUMBER TO TERMINATE 
STORE 0 TO ENTIRE 
STORE 0 TO CONDEN 
STORE 0 TO ANAL 
STORE 0 TO EMATCH 
STORE 0 TO HMATCH 
STORE -0 TO OMATCH 
STORE 0 TO IMATCH 
STORE 0 TO XMATCH 
STORE 0 TO PRINTQN 
STORE 0 TO PTF 
DO WHILE .T. 

* Program.: Master analysis. fmt 

* Date....: 12/ 9/94 

* Version.: FoxBASE+/Mac, revision 1.10 

* Notes .... : Format file Master analysis 

SCREEN 1 TYPE 0 HEADING -Screen 1' AT 40,2 SIZE 286/492 PIXELS FONT 'Geneva", 9 COLOR 0,0,0. 
G PIXELS 39,255 TO 277,430 STYLE 28447 COLOR 0,0,-1,-25600,-1,-1 
6 PIXELS 75,120 TO 178,241 STYLE 3871 COLOR 0,0,-1,-25600,-1,-1 

@ PIXELS 27,98 SAY "Customized Output Menu" STYLE 65536 FCNT "Geneva 1 , 274 COLOR 0,0,-1,-1,-1 
@ PIXELS 45/54 GET conden STYLE 65536 FONT 'Chicago", 12 PICTURE "@*c Condensed format" SIZE 
6 PIXELS 54/261 GET anal STYLE 65536 FONT 'Chicago", 12 PICTURE »@*RV Sort /number; Sort /entry i 
@ PIXELS 117,126 GET EMATCH STYLE 65536 FOOT "Chicago', 12 PICTURE "S*C Exact • SIZE 15,62 CO 
© PIXELS 135,126 GET HMATCH STYLE 65536 FONT •Chicago" . 12 PICTURE n @*C Homologous" SIZE 15,1 
©'PIXELS 153,125 GET OMATCH STYLE 65536 FONT "Chicago", 12 FICTURE "G*C Other spc" SIZE 15,84 
@ PIXELS 90,152 SAY "Matches; 6 STYLE 65536 FONT "Geneva", 268 COLOR 0,0,-1,-1,-1,-1 
<a PIXELS 63,54 GET FRIOT0N STYLE 65536 FOOT •Chicago" , 12 PICTURE "@*C Include clone listing' 
@ PIXELS 171,126 GET Imatch STYLE 65536 FONT "Chicago", 12 PICTURE "(£*C Incyte" SIZE 15,65 CO 
6 PIXELS 252,146 GET initiate STYLE 0 FOOT "Geneva", 12 SIZE* 15, 70 COLOR 0,0,-1,-1,-1,-1 
@ PIXELS 270,146 GET terminate STYLE 0 FONT 'Geneva-,^ SIZE 15,70 COLOR 0,0,-1,-1,-1,-1 
Q PIXELS 234,134 SAY "Include clones " STYLE 65536 FONT •Geneva", 12 COLOR 0, 0,-1, - 1 ,-1,-1 
6 PIXELS 270,125-SAY •->■ STYLE 65536 FONT "GeneyaM4 COLOR 0,0,-1,-1,-1,-1 
Q PIXELS 198,126 GET PTF STYLE 65536 FOOT "Chicago", 12 PICTURE "&*Q Print to file- SIZE 15,9 
<a PIXELS 189,0 TO 257,120 STYLE 3871 COLOR 0,0,-1,-25600,-1,-1 

9 PIXELS 209,8 SAY "Library selection" STYLE 65536 FOOT "Geneva' 1 , 266 COLOR 0,0,-1,-1,-1,-1 

6 PIXELS 227,18 GET ENTIRE STYLE 65536 FONT -Chicago", 12 PICTURE "@*KV All;Selected p SIZE 16 

* EOF: Master analysis, fmt 
READ 

IF ANAL=9 
CLEAR 

CLOSE DATABASES 
ERASE TEMPMASTER.D8F 

USE " Smar tGuy : FpxBASE+ /Mac : f ox f iles: clones. dbf " 
SET SAFETY ON 
SCREEN 1 OFF 

RETURN 
ENDIF 

clear 

7 INITIATE 

7 TERMINATE 
7 -CONDEN 
7 ANAL 
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? ematch 
? Hm&tch 
? Csnatch 

? IMATCK 
SET TALK ON 

IP ENTIRES 2 
USB •Unique libraries .'dbf 1 

REPLACE ALL i WITH • ' * 

BROWSE FIELDS i, libname, library, total, entered AT 0,6 
ENDIF 

USE "£mart:Guy:FoxBASE+/Mac:fox files t clones. dbf* 

*COFY TO TEMPNUM FOR NUK5ER>= INITIATE, AND . I\ T UM5ER< =TEKMDCATE' 

*US2 TEMPNUM 

COPy STRUCTURE TO TEMPLIB 
USE TEMPLXB 
IF ENTIRE* 1 

APPEND FROM 'SmartGuy:FoXBASE4/Mac:fox files: Clones, db^ 
ENDIF 

IF EOTXREt2 
USE "Unique libraries. dbf • 

COPY TO SELE CTED FOR UPPSR(i)»*Y» 
USE SELECTED 

STORE R3CC0UNT() TO STOPIT 
MARKnl 

DO WHILE .T. 
IF MARK>STOPIT 
CLEAR 
EXIT 

ENDIF 

USE SELECTED 
GO MARK 

STORE library TO TOISONE 
? 1 COPYING 1 
?? THISONE 
USE TEMPLIB 

5 FCXBASE+ />IaC : f °* fi^s:Clone S ,db£- FOR library-THISONE 

LOO? 
ENDDO 
ENDIF 

USE a SmarcGuy:FoxBASE*/Kac:fox files : clones. dbf ■ 

COUNT TO STARTOT 

COPY STRUCTURE TO TEMPDESIG 

USE TEMPDESIG 

IF Emarch=0 .AND.. Hmatch=0 .AND. Qnstch=0 .AND. IMATCH=0 

APPEND FROM TEMPLIB 

ENDIF 

IF Emacch=l 

APPEND FROM TEMPLIB FOR D= 'E' 
B3DIF 

IF Hmatchal 

APPEND FROM THMPLI3 FOR D='H' 
ENDIF 

IF Qmatchsl 

APPEND FROM TEMPLIB FOR D='0' 
ENDIF 

IF Imatchsl 

APPEND FROM TEMPLIB FOR D= * I ' .OR.D* 'X 1 .OR. lb »N» 
ENDIF 

IF Xmatchol 

APPEND FROM TgMPLIB FOR D=*X' 

ENDIF 

COUNT TO ANALTOT 

set talk off 
*•»*+*«••***#** ****** 

DO CASE 
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CASE PT?=0 

SET DEVICE TO PRINT 

SET PRINT ON 

EJECT 

CASE PTF=1 

SET ALTERNATE TO "Total function sort.txt " 

*SCT ALTERNATE TO °H and 0 function sort.txt" 

*SET ALTERNATE TO ■Shear Stress HUVEC 2; Abundance sort.txt' 

*SET ALTERNATE TO "Shear. Stress HUVEC 2 t Abundance con.txt* 

*SET ALTERNATE TO "Shear Stress HUVEC 2: Function sort.txt* 

*SET ALTERNATE TO "Shear Stress HUVEC 2: Distribution sort.txfc tt 

*$ET ALTERNATE TO "Shear stress HUVEC l;Clone Ust.txt" 

*SET ALTERNATE TO "Shear Stress HUVEC 2!Location eort.txt p 

SET ALTERNATE ON 

EWDCASE 



IF PRIOTON=l 

©1,30 SAY "Database Subset Analysis" STYLE 65536 FONT •Geneva-, 274 COLOR 0,0,0,-1,-1,-1 

ENDIF 

? 

? 

* 

? dateO 
?? • 

?? TTMBO 

? 1 Clone- nurribera 1 

?? STR ( INITIATE'/ 6,0) 

?? 1 through ' 

?? STR ( TERMINATE, 6,0) 

? 'Libraries? 1 

IP ENTIRE=1 

? 'All libraries* 

ENDIF 

IP ENTIRE=2 
MARiUl 
DO WHILE .T. 
IF MARK>STOPIT 
EXIT 
ENDIF 

USE SELECTED 
GO MARK 
7 • ' 
. ?? TRIM(libname) 
STORE MARK+1 TO MARK 
LOOP 
ENDDO 
ENDIF 

? 'Designations: • 

IF EmatchsO .AND. Kiatch=0 .AND. Onatch=0 .AND. IMATCH=0 

?? 'All 1 

ENDIF 

IF Etaatchsl 
?? 'Exact, ' 

ENDIP 

IF Hmatch=l 

?? 'Human, • 

ENDIF ' 

IF Qmatch=l 

?? 'Other. sp. ■ 

ENDIF 

IF Imatch=l 
7? 'INCITE' 
ENDIF 

IF Xiratch=l 
?? 'EST 1 

6 0 
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ENDIF 

IF CONDENel 

? * Condensed format analysis 1 

ENDIF 

IP ANALsl 

? 'Sorted by NUMBER' 

ENDIF 

IF ANAL=2 

? 'Sorted fcy ENTRY' 

ENDIF 

IF ANAL=3 

? 1 Arranged fcy ABUNDANCE' 
IF ANAL»4 

? 'Sorted by INTEREST ' 

ENDIF 

IF ANAU5 

? 'Arranged by LOCATION' 
ENDIF ' 
IF ANAL=6 

? 'Arranged by DISTRIBUTION' 

ENDIF 

IF ANAL=7 

? 'Arranged by FUNCTION 1 
ENDIF 

? 'Total clones represented! * 

?? STR(STARTOT,6,0) 

? 'Total clones analysed: ' 

?? STR(ANAUIOT,6,0J 

? 

7 *1 = library d = designation i «• distribution z = location r = function c = cer 
? 

****************** 

USE TEMPDE9IG 

SCREEN 1 TYPE 0 HEADING "Screen V AT 40,2 SIZE 286,492 PIXELS FOOT "Geneva."', 7 COLOR 0,0,0, 
DO CASE 
CASE ANAL=1 

* sort/number 

SET HEADING Ott 
IF CGNDENol 

SORT TO TEMPI ON ENTRY , NUMBER 
DO -COMPRESSION mmber.PRG' 
ELSE 

SORT TO TEMPI ON NUMBER 
USE TEMPI 

list off fields number l L,D,F,Z,R,C,an i RY,S r DSSCRIPTOR 

*Hst Off fields number, L, D, F, Z , R, C, ENTRY, S , DESCRIPTOR, LENGTH, RFEND, INIT, I 
CLOSE DATABASES 
ERASE TEMPI. DBF 
ENDIF 

CASE AMAL=2 

* sorn/DESCRIPTOR 
SET HEADING ON 

♦SORT TO TEMPI ON DESCRIPTOR , ENTRY , NUMBER/ S for D='E' .CR,D='K' .OR.D='0' .OR.D^'X 1 .OR.D= # I' 
•SORT TO TEMPI ON ENTRY, DESCRIPTOR ,NUMEER/S for D^E' .OR.Ds'H'.OR.Do'O' .OR.D='X' .OR.Da'I' 
SORT TO TEMPI ON ENTRY, START/ S for D= 'E' .OR.D^K' .OR.Ds'O' .OR.D=*X' .OR.D» f I' 
IF CCNDEN=1 

DO "COMPRESSION entry. PRO" 
ELSE 

USE TEMPI 

list off fields number, L, D, F f Z,R,C,DJ7RY, S, DESCRIPTOR, LENGTH f RFEND, INIT, I 
CLOSE DATABASES 
ERASE TEMPI. DBF 
ENDIF 
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CASE ANAL=3 

* sort toy abundance 
SET HEADING ON 

SORT TO TEMPI ON ENTRY, NUMBER for D='E' .OR.D^H'.OR.Ds'O* .OR.Dx'X'.OR.Do'I* 

DO "compression abundance, peg* 

CASE ANAL-4 

* sort/interest 
SET HEADING CN 
IF CONDEN=l 

SORT TO TEMPI ON ENTRY, NUMBER FOR I>0 
DO -COMPRESSION interest . PRO" 

SORT ON I/D, ENTRY TO TEMPI FOR I>1 
USE TEMPI 

list Off fields nuittoer,L/D/P, Z.R,C, ENTRY, S, DESCRIPTOR, LENGTH, RPEND, INIT, I 
CLOSE DATABASES 
ERASE TEMPI. DBF 
ENDIF 

CASE AKALs5 

* arrange/location 

SET HEADING ON 
STORE 4 TO AMPLIFIER 
? 'Nuclear:' 

SORT ON ENTRY/ NUMBER FIELDS RFEND, NUMBER, L,D, F, 2, R,C,E*TRY,S, DESCRIPTOR, LENGTH, INIT, I, CCMMEN 
IF CGNDEN=1 

DO •Conpression location. prg* 
ELSE 

DO "Normal subroutine 1 B 
ENDIF 

? 'Cytoplasmic; ' 

SORT ON ENTRY, NUM3ER FIELDS RF END / ^JUlffiER/ L, D/F, 2, R,C, ENTRY , S, DESCRIPTOR, LENGTH, INIT, I , CCMMEN 
IF CCNDEN=1 

DO "Compression location. prg B 
ELSE 

DO "Normal subroutine 1" 
ENDIF 

? •Cycbskeleton: 1 

SORT ON- ENTRY, NUMBER FIELDS RFD3D, NUM3ER, L , D, F , Z , R, C, ENTRY, S , DESCRIPTOR, LENGTH, INIT, I , COMMEN 
IF OGMDEN&l 

DO ^Compression location.org" 
ELSE 

DO 'Normal subroutine 1 H 
ENDIF 

? •Cell surface: 1 

SORT ON ENTRY, NUMBER FIELDS RFS^,NUMBER, L,D, F,Z,R,C, ENTRY, S, DESCRIPTOR, t£NOTH, INIT, I, CCMMEN 
IF 00NDEN=1 

DO "Compression location. prg' 
ELSE 

DO "Normal subroutine 1" 
ENDIF 

? 'Intracellular membrane: 1 

SORT ON ENTRY, NUMBER FIELDS RFEND, NUMBER ,L,D,F, Z, R,C, ENTRY, S, DESCRIPTOR, LENGTH, INIT, I, COMMEN 
IF CQNDEN=1 

DO "Compression location. prg" 
DO ■ Normal subroutine 1° 

ENDIF 

? 'Mitochondrial: 1 

SORT ON ENTRY, NUMBER FIELDS RFEND, NUMBER, L,D,F, 2, R,C, ENTRY, S, DESCRIPTOR, LENGTH, INIT, I, COMMEN 
IF CaNDENal 

DO "Compression location. prg* 
ELSE. 

DO 'Normal subroutine 1" 
ENDIF 
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? ' Secreted I * 

SORT ON ENTRY, NUMBER FIELDS RFEND,NUM3ER,L,D,F, Z,R,C, ENTRY, S, DESCRIPTOR, LENGTH, INIT#I,CCMMEN 
IP CGNDEN*! 

DO "Confession location. pr?" 
ELSE 

DO "Normal subroutine 1" 

ENDIF 

? ' Other i 1 

SORT ON ENTRY, NUMBER FIELDS RFEND, NUMBER. L,D,F, Z, R,C, ENTRY, S, DESCRIPTOR f LENGTH, INTT, I, OOMMEN 
IF CONDENel 

DO "Canpression location. pro" 
ELSE 

DO "Normal subroutine 1* 
ENDIF 

? 'Unknown:' 

SORT ON ENTRY, NUMBER FIELDS RFEND, NUMBER,L,D,F,Z,R,C, ENTRY, S, DESCRIPTOR, I3lGTK,INIT,I,CC»«*a? 
IF C0NDEN=1 

DO "Compression location .prg* 
ELSE 

DO "Normal subroutine 1" 
ENDIF 

IF C0NDSN=1 

SET DEV ICE. TO PRINTER 

SET PRINTER GEN 

EJECT 

DO "Output heading* prg 1 
USE "Analysis location. dbf * 
DO "Create bar graph, prg 1 
SET -HEADING OFF 

? 1 FUNCTIONAL CLASS TOTAL UNIQUE NEW % TOTAL 1 

• . 

LIST OFF FIELDS Z , NAME , CLONES , GQJES , NEW , FERCEWT, GRAPH 
CLOSE DATABASES 
ERASE TEKP2 . DBF 
SET HEADING ON 

*USE p SmartGuy:FoxBAS2*/Mac:fox files iTEMFMASTER. db£" 
ENDIF 

CASE ANAL=6 

* arrange/distribution 
SET HEADING ON 

STORE 3 TO AMPLIFIER 

? 'Cell/ tissue specific distribution; 1 

SORT ON QSTRY, NUMBER FIELDS RFEND, NUMBER, L, D, F, Z, R, C, ENTRY, S, DESCRIPTOR, LENGTH, INIT, I, CGMMEN 
IF COTDENsl 

EC "Compression disc rib. prg" 
ELSE 

DO "Normal subroutine 1" 
ENDIF 

? 'Non-specific distribution: 1 

SORT ON EnTRY, NUMBER FIELDS RFEND, NUMBER , L, D, F , Z , R , C , ENTRY, S / DESCRIPTOR, LENGTH , INIT, I , COMMEN- 
IF CONDENsl 

DO "Cotipression distrib. prg" 

DO "Normal subroutine 1" 
MDIF 

? 'Unknown distribution: 1 

SORT CN ENTRY, NUMBER FIELDS RFEND, NUMBER, L, D, F , Z, R, C, ENTRY, S, DESCRIPTOR ,LS5GTH, INIT, I,C0MMEW 
IF CGNDENsl 

DO "Corpression distrib.prg* 
ELSE 

DO "Nonral subroutine 1" 
D©IF 

IF CCNDEN=1 

SET DEVICE TO PRINTER 

SET PRINTER ON . 
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EJECT 

DO "Output heading. prg' 

USE 'Analysis distribution.dbf 1 

DO 'Create bargraph.prg" 

SET HEADING OFF 

? » FUNCTIONAL CLASS TOTAL UNIQUE % TOTAL' 

? ' 

LIST OFF FIELDS P , NAME , CLONES , GENES , PERCENT, GRAPH 
CLOSE DATABASES 
ERASE TEMP2 . DBF 
SET HEADING ON 

*USE 0 SznartGuy:FoxBASE+/Mac:£ox files iTEMPMASTSR.dbf B 
ENDIF 

» 

CASE ANAL=7 

* arrange/ function 

SET HEADING ON 

STORE 10 TO AMPLIFIER 

? ' BINDING PROTEINS' 

? 

? 'Surface molecules and receptors: 1 

£2*^5137^ ' WU ^ ER FIELDS R** 2 ^, NUMBER, L/D, F, Z,R,C, ENTRY, S, DESCRIPTOR, LENGTH, INIT, I, CQMMEN 
IF CONDEN=l 

DO "Compression function, prg" 

DO 'Normal subroutine 1" 
ENDIF 

? ' Calcium- binding proteins; ' 

gg R T ^ ENTRY .NUMBER FIELDS REEND, NUMBER, L,D,F, Z,R,C, ENTRY, S, DESCRIPTOR, ISWTH, INTT, I^COMMEN 
xr CQNDEiN = l » 
DO "Compression function •pro" 
ELSE 

DO "Normal subrdutine l u 
ENDIF 

? 'Ligands and effectorst' 

SORT ON ENTRY, NUMBER FIELDS RFEND, NUMBER, L, D, ? , Z , R , C , ENTRY , S , DESCRIPTOR , I5NGTH, INIT, I, CCMMEN 
IF CONDEN B 2 

DO ' Congress ion function. prg" 
ELSE 

DO "Normal subroutine 1" 
ENDIF 

? 'Other binding proteins: • 

SORT^OT^ENTRY , NUMBER FIELDS RFEND, NUMBER, L,D,F, Z, R ( C, ENTRY, S, DESCRIPTOR, LmHH,INIT, I,COMMEN 

DO "Compression function. prg" 
ELSE 

DO "Normal subroutine l* 

ENDIF 

•EJECT 



9 • 



ONCOGENES' 
? 'General oncogenes: 1 

SORT ON ENTRY, NUMBER FIELDS RFEND, NUMBER,L,D,F, 2 ( R,C, ENTRY, S, DESCRIPTOR, LENCmi, INIT, I,COMMEN 
IF GGNDEN=1 

DO ■Compression function .prg" 
ELSE 

DO •Normal subroutine 1" 
ENDIF 

? 'Grp-binding proteins i ' 

SORT ON ENTRY, NUMBER FIELDS RFEND, NUMBER, L, D, F, Z , R, C , ENTRY , S , DESCRIPTOR, LENGTH, INIT, I , COWMEN 
IF CONDEN^l w^^aoaiww*^ 

DO ••Compression function. prg" 
ELSE 

DO "Normal subroutine 1" 
ENDIF 

? 'Viral elements i • 
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SORT ON ENTRY, NUMBER FIELDS RFEND , NU>£BER ,L,D,F,Z,R,C, ENTRY, S , DESCRIPTOR , LENGTH / INIT , I , COMMEN 
IF COpJDENal 

DO "Compression function. prg" 
ELSE 

DO "Normal subroutine 1" 
ENDIF 

? 'Kinases and Phosphatases:' 

SORT OH ENTRY, NUMBER FIELDS RFEND, NUMBER, L,D,F, 2, R, C, ENTRY, S, DESCRIPTOR, LENGTH, INIT,I,eaMMEN 
IF CCNDEN=1 

DO "Cdtnpression function. prg* 
ELSE 

DO "Normal subroutine 1" 
ENDIF 

7 'Tumor-related antigensi ' 

SORT ON ENTRY, NUMBER FIELDS RFEND, NUMBER, L, D, F, Z , R, C , ENTRY , S , DESCRIPTOR, LENGTH , INJT, I , COWMEN 
IF CONDEN=l 

DO "Compression function, prg" 
ELSE 

DO "Normal subroutine 1" 

ENDIF 

♦EJECT 

7 ' PROTEIN SYNTHETIC MACHINERY PROTEINS' 

7 

? 'Transcription and Nucleic Acid-binding proteins: ' 

SORT ON ENTRY, NUMBER FIELDS RFEND, NUMBER, L, D, F, Z, R, C, ENTRY, S, DESCRIPTOR, LENGTH, INIT, I , OOfoMEN 
IF C0NDEN=1 

DO "Compression function. prg* 
ELSE 

DO "Normal subroutine 1" 

ENDIF 

7 'Translation: ■ 

SORT ON ENTRY, NUMBER FIELDS RFENDfNUNEERfL^F, Z, R,C, ENTRY, S, DESCRIPTOR, LENGTH, INIT, I,O0MMEN 
IF CaNDEN=l 

DO 'Compression function* prg" 
ELSE 

DO "Normal subroutine 1' 

ENDIF 

7 1 Ribo serial proteins : ' 

SORT ON ENTRY, NUMBER FIELDS RFEND,NUMBER,L r D / F,Z,R,C f ENTRY, S, DESCRIPTOR, LENGTH, INIT, I,CCMMEN 
IF CONDENal 

DO "Compression function .prg" 
ELSE 

DO "Normal subroutine 1" 
ENDIF 

? 1 Protein processing i ' 

SORT ON ENTRY, NUMBER FIELDS RFEND, NUMBER, L,D,F,Z,R,C, ENTRY, S, DESCRIPTOR, LENGTH, INIT, I, COMMEN 
IF CONDEN=l 

DO "Compression function, prg". 
ELSE 

CO "Normal subroutine I 1 

ENDIF 

♦EJECT 

? 1 ENZYMES' 
7 

? 'Ferroproteinsi 1 

SORT ON ENTRY, NUMBER FIELDS RFEND, NUMBER, L,D,F,Z,R,C, ENTRY , S, DESCRIPTOR , LENGTH, INIT, I, CQMMEN 
IF CONDEN=l 

DO "Compression function .prg" 

DO "Normal subroutine 1" 
ENDIF 

7 1 Proteases and inhibitors : • 

90RT ON ENTRY, NUMBER FIELDS RFEND ,NUMBER, L, D, F, Z, R, C, ENTRY, S, DESCRIPTOR, U2NGTH, INIT, 1/ COMMEN 
IF CONDEN=l 

DO "Con$>ression function. prg" 
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DO "Normal subroutine 1" 

? 'Oxidative phosphorylation: ' 

??iT' MBR fislds i™'*^*'*.*.*.**.*^ 

DO "Compresaion function. prg B 
ELSE 

DO "Normal subroutine 1" 
QDIF 

? 'Sugar -metabolism! 1 

S R ?c2eIS , * HY ' NU?Q£R FIELDS 1 ^'^^' i ' D ' F ' Z ' R ' C '^^S,DESCiaWOR l I^H < nnT/I.COMMEN 

DO "Compression function .pro* 
ELS3 

DO 'Normal subroutine 1" 
EOXF 

? 'Amino acid metabolism: 1 

DO ■ Compression function, prg" 
ELSE 

DO "Normal subroutine .1" 
ENDIF 

? 'Nucleic acid metabolism! • 
DO "Compression function. pre" 

ELSE 

DO 'Normal subroutine 1" 
EMDIF 

? 'Lipid metabolism: ' 

S R CcSEtS RY/OTMBER FIELDS R ^'* U ^' L ^ 
do "Compression function, prg" 

ELSE 

DO •Normal subroutine 1" 
ENDIF 

? 'Other enzymes i 1 * 

DO "Compression function. pro" 
ELSE 

DO 'Normal subroutine 1 B 

ENDIF 

♦EJECT 

* * MISCELLANEOUS CATEGORIES 1 

? 'Stress 'response: 1 

if^S^^^ 15 ^ FIELDS *^'^^' L ' D ' F ' 2 '*' C <^ y 'S^^ 

DO •Compression function. prg" 

ELSE 

DO 'Normal subroutine 1" 
ENDIF 

? 'Structural! 1 

IF K CO?toSi RY,NU ^ IB£K FIELDS W ^'^^<L'°> F 'Z,R,C,EmY,S,DES^^^^ 

DO "Compression function. prg" 

EL SE 

DO ■Normal subroutine 1° 
ENDIF 

? 'Other clones! ' 

IF R aSS RY ' N ^ BER ?IELDS Wm ^^' L ' D ' F ' 2 ' R ' C '™ Y ' S ^^ 

DO "Compression f unction. prg a 
ELSE 
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DO "Normal subroutine 1" 
ENDIF 

? 'Clones of unknown function! 1 

SORT ON ENTRY, NUMBER FIELDS RFEI3D, NUM3ER, L, D, Fi Z , R, C , ENTRY, S , DESCRIPTOR/ LENGTH, INIT, I , COMMEN 
IF C0NDEN=1 

DO 1 Compression function .pro;* 
ELSE 

DO "Normal subroutine 1" 
ENDIF 

IF C0NDEN=1 

EJECT 

♦SET DEVICE TO PRINTER 
*SET PRINT C3N 

DO 'Output heading .prg" 

USE 'Analysis function, dbf" 

DO "Create bargraph.prg* 

SET HEADING OFF 
*** 

SCREEN 1 TYPE 0 HEADING "Screen 1* AT 40,2 SIZE 2?6,492 PIXELS PONT "Geneva" ,12 COLOR 0,0,0 

? • TOTAL TOTAL NSW DIST 

? • FUNCTIONAL CLASS CLONES GENES GENES FUNCTIONAL CLASS 1 

♦LIST OF? FIELDS P,NAME, CLONES, GENES, NEW, PERCENT, GRAPH, COMPANY 
LIST OFF FIELDS P , NAME , CLONES , GENES , NEW, PERCENT , GRAPK 
CLOSE DATABASES 
ERASE TEMP2 .DBF 
SET HEADING ON 

*USE y SrnartGuyiFoxBASE+/Mac>fox files iTEMPMASTER. dbf " 
■ . ENDIF 

CASE ANAL=8 

DO "Subgroup sumtary 3.prg° 
DTOCASE 

DO 'Test print. prg" 
SET PRINT OFF 
SET DEVICE TO SCREEN 
CLOSE DATABASES 
•ERASE TEMPUB.DBF 
•ERASE TEttPNUM«DBF 

* ERASE TEMFDESIG . DBF 

* ERASE SELECTED. DBF 
CLEAR 

LOOP 
ENDDO 
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* COMPRESSION SUBROUTINE FOR ANALYSIS PROGRAMS 

USE TEMPI 

COUNT TO TOT 

REPLACE ALL RFEND WITH 1 

MAKK1 =1 

SW2&Q 

DO WHILE SW2=0 ROLL 
IF MARK1 >d TOT 
PACK 

COUNT TO UNIQUE 

COUNT TO NEWGENES FOR D= 'H* .0R.D='O' 

SW2=1 

LOOP 

sndif 

GO MARX1 
DUP s 1 

STORE ENTRY TO TESTA 

sw = o 

DO WHILE SW=0 TEST 
SKIP 

STORE EOTRY TO TESTE 

IP TESTA = TESTS 

DELETE 

DUP = DUPrl 

LOOP ■ 

ENDIF 
60 MARK1. 

REPLACE RFEND WITH DUP 
MARK! - MARKl+EOT 
SW=1 
LOOP 

ENDDO TEST 
LOOP 

ENDDO ROLL 
•GO TOP 

STORE Z TO LOC * 

USE * Analysis location, dbf 

LOCATE FOR Z=L0C 

REPLACE CLONES WITH TOT 

REPLACE GENES WITH UNIQUE 

REPLACE NEW WITH. NEWGENES 

USE TEMPI 

SORT ON RFEND'/D TO TEMP2 

USE TEKP2 

77 STR(UNIQUE,5,0] 

77 1 genee, for a total of 1 

77 STO(T0T,5,0) 

77 ' .clones 1 

? ' V Coincidence 1 

list off fields number, RFEND,L,D,F,Z,R,C,SKTRY,S, DESCRIPTOR, LENGTH, HOT, I 

•SET PRINT OFF 
CLOSE DATA3ASES 
ERASE TEMPI . DBF 
ERASE TEMP2»DBF 
USB TEMPDESIG 
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* COMPRESSION SUBROUTINE FOR ANALYSIS PROGRAMS 
USE TEMPI 

COUNT TO TOT 

REPLACE ALL RFEND WITH 1 

MARK1 * 1 

SW2**0 

DO WHILE SW2=0 ROLL 
IP MARK1 TOT 
PACK 

COUNT TO UNIQUE 

SW2=1 

LOOP 

ENDXP 
GO MARKl 
CUP = 1 

STORE ENTRY TO TESTA 

SW m 0 

DO WHILE SW=0 TEST 
SKIP 

STORE ENTRY TO TESTB 
IF TESTA = TESTB 
DELETE 
DUP *= DUP+1 
LOOP 
• END IF 

GO HARKl 

REPLACE RFEND WITH DUP 

MARK1 = MARK1+DUP 

SW=1 

LOOP . 

ENDDO TEST 

LOOP 

ENDDO ROLL 

* BROWSE 

■*SET PRINTER ON 

SORT ON DATE TO TEMP2 

USE TEMP2 

?? STR (UNIQUE, 4,0) 

?? 1 genes, for a total of 

?? Sra(TOT,4,0} 

77 clones' 

7 

? 1 V Coincidence 1 

COUNT TO P4 FOR I»4 

IF P4>0 

? STR(P4,3,0) 

77 1 genes with priority = 4 (Secondary analysis:) 1 

list off fields number, RFEND,L,D,F,Z,R,C,ENrRX/ S/ DESCRIPTOR, LENGTH, INIT for 1=4 
? 

SNDIF 

COUNT TO ?3 FOR 1*3 

IF P3>0 

? STR(P3,3,0) 

?? 1 genes with priority a 3 (Full insert sequence:) 1 

list off fields number , RFEND < L , D , F* Z , P./ C , ENTRY, S , DESCRIPTORi LENGTH, INIT for 3«3 
o 

ENDIF 

COUOT TO P2 FOR 1=2. 

IF P2>0 

? STR(P2,3,0) 

?? 1 genes with priority » 2 (Primary analysis corrplete;)' 

list off fields nuinber,RFmD,L,D,F,z,R,C,E^Y, 6, DESCRIPTOR, for 1=2 
? 

ENDIF 

COUNT TO PI FOR 1=1 
IF P1>0 
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? STRfPl^.O) 

?? ■ genes with priority = 1 (Primary analysis needed. >' 

ENDXF°" £ielQ6 n ^ er '^^' L / D ^ z /R-C,:am^^ for 1*1 

•SET PRINT OFF 
CLOSE DATABASES 
ERASE TEMPI. DBF 
ERASE TEMP2.DBF 

USE 'SmartGi^riFoxBASEt/Macjfox files: clones, dbf 
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♦ COMPRESSION SUBROUTINE FOR ANALYSIS PROGRAMS 

USE TEMPI 

COUNT TO TOT 

REPLACE ALL RFEND WITH 1 

MARK1 = 1 

SW2cO 

DO WHILE SW2=0 ROLL 
IF MARK1 >s TOT 
PACK 

COUNT TO UNIQUE 

SW2=1 

LOO? 

ENDIF 
GO MARK1 
TOP = 1 

STORE ENTRY TO TESTA 
SW b 0 

DO WHILE SW=0 TEST 
SKIP 

STORE ENTRY TO TESTS 

IF TESTA s= TESTB 

DELETE 

TOP = DUP+1 

LOOP 

ENDIP 
GO MARK1 

REPLACE RFEND WITH DUP 
MARK1 c KARKl-fDUP 
6W=1 
LOOP 

ENDDO TEST 
LOOP 

ENDDO ROLL 
* BROWSE 

*SET PRINTER ON 

SORT ON NUMBER TO TEMP2 

USE TB4P2 

?? STR (UNIQUE, 4,0) 

?? ' genes, for a total of 1 

?? STR(TOT,5 # 0) 

?? 1 clones ' 

? ' v Coincidence 1 

Hat Off fields nuniber « RFEND # L # D, F, Z , R, C , ENTRY , S , DESCRIPTOR , LENGTH , INIT, I 

♦SET PRINT OFF 
CLOSE DATABASES 
ERASE TEMPI .DBF 
ERASE TEMP2 .DBF 

USE 'SnartGuy:FoxBASE+/M£c:fax files : clones. dbf" 
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* COMPRESSION SUBROUTINE FOR ANALYSIS PROGRAMS 
USE TEMPI 

COUNT TO TOT 

REPLACE ALL RFEND WITH 1 

MARK1 =1 

SW2=0 

DO WHILE SW2=0 ROLL 
IF MARK1 >s TOT 
PACK 

COUNT TO UNIQUE 

COUNT TO NEWSENES FOR D='H l ,OR.D='0' 

SW2«1 

LOOP 

ENDIF 
GO MARK1 
DUP - 1 

STORE ENTRY TO TESTA 

sw d 6 

EO WHILE SW=0 TEST 
SKIP 

STORE ENTRY TO TEST3 

IF TESTA = TSSTB 

DELETE 

DUP = DUP+1 

LOOP 

ENDIF 
GO MARK!" 

REPLACE RFEND WITH DUP 
MARK1 « KARKl+DUP 
SW=1 
LOOP 

ENDDO TEST 
LOOP 

ENDDO ROLL 
GO TOP 

STORE R TO FUNC 
USE "Analysis f unction. dbf" 
LOCATE FOR P=FUNC 
'REPLACE CLONES WITH TOT 
REPLACE rarraft WITH UNIQUE 
REPLACE NEW WITH NEWGB3E5- 
USE TEMPI 

SORT ON RFEND/ D TO TEMP2 

USE TEMP2 

SET HEADING CN 

?? STR (UNIQUE/ 5/ 0) 

?? 1 genes, for a total of 1 

?? STR(TOT f 5,0) 

?? 1 clones 1 
*** 

? 1 V Coincidence 1 

list' off fields number, RFEND, L, D, F, Z, R, C , ENTRY, S, DESCRIPTOR, LENGTH, INIT, I 

WW* 

* SCREEN 1 TYPE 0 HEADING "Screen 1- AT 40,2 SIZE 286,492 PIXELS FONT "Geneva - , 12 CQI^R 0,0, 
♦list Cff fields RFEND, S, DESCRIPTOR 

♦SET PRINT OFF 
CLOSE DATABASES 
ERASE TEKP1.DBF 
ERASE TOTP2.DBF 
USE TBMFDESIG 
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* COMPRESSION SUBROUTINE FOR ANALYSIS PROGRAMS 
USB 

COUNT TO TOT 

REPLACE ALL RFEND WITH 1 

MARK1 n 1 

DO WHILE SW2=0 ROLL 
IF MARK1 >= TOT 
PACK 

COUNT 1 TO UNI QUE 

6W2=1 

LOOP 

ENDIP 
GO MARK1 
DU? = 1 

STORE ENTRY TO TESTA 

SW m 0 

DO WHILE SW=0 TEST 
SKIP 

STORE ENTRY TO TESTES 
IF TESTA » TESTS 

DUP = DUP+1 
LOOP 
EKDIF 
GO MARK1 

REPLACE RFEND WITH DUP 
MAKK1 = MARK1+DUP 
SW=1 
LOOP 

ENDDO TEST 
LOOP 

ENDDO ROLL 
GO TOP 

STORE F TO DIST 

USE "Analysis distribution, dbf 
LOCATE FOR P^DIST 
REPLACE CLONES WITH TOT 
REPLACE GENES WITH UNIQUE 
USE TEMPI 

cart on rfend/d to TEMP2 

USE TEMP2 

?? STR {UNIQUE, 5/0) 

?? 1 genes, for a total of 1 

?? SIR (TOT, 5,0) 

?? 1 clones' 

? 1 V Coincidence 1 

list off fields number, RFEND, L, D, F, Z, R, C , ENTRY, S , DESCRIPTOR/ LENGTH, INIT, I 

*SET PRINT OFF 
CLOSE DATABASES 
ERASE TEMPI. DBF 
.ERASE TSMP2 ,DBF 
USE TEMPDESIG 
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* COMPRESSION SUBROUTINE FOR ANALYSIS PROGRAMS 

USE TEMPI 

COUNT TO TOT 

REPLACE ALL RFEND WITH 1 

MARK1 m 1 

SW2-0 

DO WHILE SW2=0 ROLL 
IF MARK1 >« TOT 
PACK 

COUNT TO UNIQUE 

SW2=1 

LOOP 

END IP 
GO MARKl 
DUP » 1 

STORE ENTRY TO TESTA 
SW a 0 

DO WHILE SW=0 TEST 
SKIP 

STORE ENTRY TO TESTE 

IF TES TA = TESTE 

DSUETE 

DUP .= DUP+1 

LOOP 

ENDIF 
GO MARKl 

REPLACE -RFEND WITH DUP 
MARKl * MAHK1+DUP 
SW=1 
LOOP 

3©DO TEST 
LOOP 

ENDDO ROLL ' 

GO TO? 

USE TEMPI 

7? STR (UNIQUE, 5,0) 

77 1 genes, for a total of • 

?? STR(TOT,5,0) 

7? 1 clones ' 

' 1 V Coincidence ' . 

list Off fields nunber,RFEM^L,D,F,Z,R,C,araY,S^ 

*SET PRINT OFF 
CLOSE DATABASES 
ERASE TEMPI. DBF 
USE TEMPDESIG 
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•COMPRESSION SUBROUTINE FOR ANALYSIS PROGRAMS 
IKE "SraartGuy:FoxHASE4./Mac::ox filesrclones.dbf ■ 
copy TO TEMPI FOR 
USE TEMPI 

COUOT TO IDGENE FOR D» 'E' .OR.D= '0' ,OR.D= 'H' .OR.Db 'N' .OR D-'R' OR d--xi 

PJOC^ ^ ^ S:,I ' , -OR.Ds'D 1 .OR.Ds'A* .OR.Dr'U' .OR^Oa'S 1 !or!d='M' ioR^Ds'R* .OR.Ds'V 

COUNT TO TOT 

REPLACE ALL RFEND WITH 1 

MARK! = 1 

SW2=0 

DO WHILE SW2=0 ROLL 
IF MARK1 >= TOT 
PACK 

COUNT TO UNIQUE 

SW2=1 

LOO? 

ENDIF 
GO MARK1 
DUP B 1 

STORE ENTRY TO TESTA 
SW «= 0 

DO WHILE SW=0 TEST 
SKIP 

STORE ENTRY TO TESTB 

IF TESTA = TESTS 

DELETE" 

DUP s* DUP+1 

LOOP 

ENDIF 
GO MARK1 

REPLACE RFEND WITH DUP 
MARKl = MARK1+DUP 
SW=1 

LOOP 

ENDDO TEST 
LOOP 

ENDDO ROLL 
* BROWSE 

*SET PRINTER ON 

SORT ON RFEND/ D, NUMBER TO TEMP2 
USE TEMP2 

REPLACE ALL START WITH RFEND/IDGENE*10000 

?? STR (UNIQUE, 5,0) 

?? • genes, for a total of ■ 

?? STR(TOT,5,0) 
?? ' clones' 

? 1 Coincidence V v Clones/10000' 

set heading off 

SCREEN 1 TYPE 0 HEADING 'Screen V AT 40,2 SIZB 286,492 PIXELS FOOT 'Geneva" 7 COLOR 0 0 0 

CLOSE DATABASES 
ERASE TEMPI. DBF 
ERASE TEMP2.DBF 

USE *SmartGuy:FoxBASE+/Mac:fox f iles: clones, dbf 
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* COMPRESSION SUBROUTINE FOR ANALYSIS PROGRAMS 
USE TEMPI 

COUNT TO IDGENE FOR D='3\0R,I>='0\0R.D= 'H\OR.D='N^OR.D= 'R' .OR. D=»A' 

DELETE FOR D« , N'.OR.D='D^OR.Ifc•A^OR.D='U^OR.D=•S^OR.D«;M^OR.D=•R^OR.D»'V' 

COUNT TO TOT 

REPLACE ALL RFEMD WITH 1 

MARK1 b 1" . * 

SW2=0 

DO WHILE SW2=0 ROLL 
IF MARK1 >= TOT 
PACK 

COUNT TO UNIQUE 

SW2=1 

LOOP 

ENDIF 
GO MARK1 
DOPe 1 

STORE ENTRY TO TESTA 
SW « 0 

DO WHILE SW=0 TEST 
SKIP 

STORE ENTRY TO TCSTB 
IF TESTA = TESTS 
DELETE 
CUP * DUP+1 
LOOP - 
ENDIF 

GO MARX1 

REPLACE RFEND WITH DUP 
MARK1 = MARX 1-t- DUP 
SW=1 

LOOP 

ENDDO TEST 
LOOP 

ENDDO ROLL 
♦BROWSE 

♦SET PRINTER ON 

SORT ON KFEND/D, NUMBER TO TEMP2 
USE TEMP2 

REPLACE ALL START WITH RFEMD/ IDGENE* 10000 

?? STR (UNIQUE, 5,0) 

?? 1 genes, for a tdtal of • 

?? STR(TOr,5,0) 

?7 ' clones' 

? • Coincidence V v Clones/10000 1 

set heading off 

SCREEN 1 TYPE 0 HEADTO3 "Screen 1- AT 40,2 SIZE 266,492 PIXELS FONT -Geneva" ,7 COLOR 0,0,0/ 

list fields number , RFEND, START , L , D, ? , Z , R, C , ENTRY, S , DESCRIPTOR; INIT, I 

*SET PRINT OFF ' 

CLOSE DATABASES 

ERASE TEMPI. DBF 

ERASE TEMP2.DBF 

USB "SmartGuy;FoxBASE+/Mac:fojc files ! clones, dbf 
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USB TEMPI 
COUNT TO TOT 

?? ' Total of 1 

?? STR(T0T,4,0) 
?7 1 clones 1 
7 

♦list Off fields number, L, D,F, Z,R,C, ENTRY, DESCRIPTOR, LENGTH, RFEND, INIT,I 
list Off fields number, L, D,F # Z, REENTRY, DESCRIPTOR 
CLOSE DATABASES 
ERASE • TEMPI . DBF 
USB TEMPDESIG 
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♦Life scan menu; version 8-7 -94 

SET TALK OFF 

sec device to screen 

CLEAR 

USE tt SmartGqy:FoxBASE+/Mac:fox f iles: clones. dbf* 
STORE LUPDATEO TO Update 
GO BOTTOM 

STORE RECNQO TO cloneno 
STORE 6 TO Chooser 
DO WHILE .T. 

* Program.: Lifeseg menu.fmt 

* Date....t 1/11/95 

.* Version.: FoxEASE+/Mac, revision 1.10 

* Notes. . . . i Format file Lifeseq menu 

SCREEN 1 TYPE 0 HEADIN3 "Screen 1 M AT 40,2 SIZE 286,492 PIXELS FOOT "Geneva", 268 COLOR 0,6, 
C PIXELS 18,126 TO 77,365 STYLE 2B479 COLOR 32767,-25600,-1,-16223,-16721,-15725 
@ PIXELS 110,29 TO 188,217 STYLE 3871 COLOR 0,0,-1,-25600,-1,-1 

$ PIXELS 45,161 SAY "LIFESEQ" STYLE 65536 FONT "Geneva", 536 COLOR 0,0,-1,-1,7135,5884 

0 PIXELS 36,269 SAY "TO" STYLE 65536 FONT 'Geneve', 12 COLOR 0,0,-1,-1,7135,5884 

0 PIXELS 63,143 SAY "Molecular Biology Desktop" STYLE 65536 FONT "Helvetica", 18 COLOR 0,0,0, 

8 PIXELS 90,252 TO 251,467 STYLE 28447 COLOR 0,0,-1,-25600,-1,-1 

8 PIXELS 117,270 GET Chooser STYLE 65536 FONT "Chicago", 12 PICTURE ■G+RV Transcript profiles 
8 PIXELS 135,128 SAY Lfpdace STYLE 0 FOOT 'GenevaM2 SIZE 15,79 COLOR 0,0,0,-25600,-1,-1 ' 
0 PIXELS 171,128 SAY cloneno STYLE 0 FONT "Geneva ",12 SIZE 15,79 COLOR 0 f 0, 0,-25600, -l',-l 
0 PIXELS 135,44 SAY "Last update:" STYLE 65536 FOOT 'Geneva\12 COLOR 0,0,-1,-1,-1,-1 
0 PIXELS 171,44 SAY "Total clones:" STYLE 65536 FONT "Geneva", 12 COLOR 0,0,-1,-1,-1,-1 
0 PIXELS 45,296 SAY "vl.30" STYLE 65536 FOOT "Geneva", 782 COLOR 0,0,-1,-1,-1,-1 

* EOF: Lifeseg menu.fmt 
READ 

DO CASE 

CASE Chooser=l 

DO * Smar tGuy : FoxEASE* /Mac : fox files :Output programs iMaster analysis 3.prg" 
CASE Chooser =2 

DO "SmartGuy:Fox3ASE+/Mac:fox files: Output program; Subtraction 2.prg" 
'CASE Chooser=3 

DO "SmartGuyiFoxBASE+/Mac:fox files: Output programs: Nor them (single) .prg" 

CASE Chooser=4 

USE "Libraries. dbf ■ 

BROWSE 

CAiSE Chooser* 5 

DO ■SroaxtGuy:FoxEASE+/Mac:fox files: Output programs i See individual clone. prg" 

case chooeer=6 

DO , SmartGuy:FaxBASE+/Mac:fox files: Libraries! Output programs: Menu. prg" 

CASE Chooser=7 

CLEAR 

SCREEN 1 OFF 

RETURN 

ENDCASB 

LOOP 



7 8 



WO 95/20681 



PCT/US95/01160 



01,30 SAV "Database Subset: Analysis" STYLE 

7 ' 

7 

7 

? dateO 

• * 

77 TIMBO 

7 'Clone numbers 1 
?? STR (INITIATE, 6,0) 
?? 1 through * 
77 STR (TERMINATE, 6,0) 
7 •Libraries: ■ 

IP ENTIRE=1 

7 'All libraries' 

END IF 

IF ENTIRE=2 
KARKnl 
DO WHILE .T. 
IF MkRK>STO?IT 
EXIT 
EMDIF 

USE SELECTED 
GO MARK 
7 ' ' 

77 TRIM(libnaxna) 
STORE MARK+1 TO MARK 
LOOP 
ENDDO 
ENDIF 

? 'Designationsi ' 

IF Eraatch=0 .AND. Hmatch=0 .AND. Cmatch=0 

?? 'All' 

ENDIF 

IF Eraaech*! 
?7 'Exact, 1 

endif 

IF Hmatch=l 
?? 'Human, ' 
ENDIF 

IF Qmatch=l 
7? 'Other sp. 1 
ENDIF 

IF CONDENd 

? 'Condensed format analysis' 

ENDIF 

IF ANAL=1 

7* 'Sorted by NUMBER' 

ENDIF 

IF ANAL=2 

? 'Sorted by ENTRY 1 

ENDIF 

IF ANAL=3 

? 'Arranged by ABUNDANCE' 

ENDIF 

IF ANAL=4 

? 'Sorted by INTEREST' 

ENDIF 

IF ANAL=5 

? 'Arranged by LOCATION ' 

ENDIF 

IF ANAL-5 

7 'Arranged by DISTRIBUTION ' 

ENDIF 

IF ANAL=7 

7 'Arranged by FUNCTION' 



65536 FONT "Geneva\274 COLOR 0,0,0,-1,-1,-1 

■ 
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EKDIF 

? 'Total clones represented: 
?? STR(OTlRTOT,6,0) 

? 1 Total clones analyzed! 1 

?? STR(AtfALTOT,6,0) 

? 

? 
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USE TEMPI 
COUNT TO TOT 
?? ' Total of 1 
?? STO(IOT,4,0) 
?? ' clones' 
7 

*list o'ff fields nuiitoerjL^FiZjRfCiHOTY/DE^ 

liat off fields nuiriber,L,D,F, Z,R,C, ENTRY, DESCRIPTOR 

CLOSE DATABASES 

ERASE TEMPI . DBF 

USE TEMPDESIG 



81 



WO 95/20681 PCT/US95/01160 



USE TEMPI 
COUMT TO TOT 

' Total of 1 
?? STR(TO1\4,0) 
?? 1 clones? 
? . 

*list off fields nuroircr,L,D,F,Z,a,C,EOTRY^ 

list off fields number, L, D, F r Z,? r C/H77EV t D£SCH2FTOR 

CLOSE DATABASES 

ERASE TEMPI, DB? 

USE TEMPDESIG 



82 



• 



WO 95/20681 



PCT/US95/01160 



♦Northern (single), version 11-25-94 

close databases 

SET TALK OFF 

SETT PRINT OF? 

SET EXACT OFF . 

CLEAR 

STORE ' » to Eobject 

STORE 1 • TO Dobject 

STORE 0 TO Numb 
STORE 0 TO Zog 
STORE 1 TO Bail 
DO WHILE .T. 

* Program.: Northern (single) .fmt 

* Date : 8/ 8/94 

* Version.: FoxBASE+ /Mao, revision 1.10 

* Notes : Format file Northern (single) 

f C ^Sri JPS,° H^^'Bcreen 1- AT 40,2 S22E 286**492 PIXELS FONT "Geneva' ,12 COLOR 0,0,0 

Q PIXELS 15,81 TO 46,397 STYLE 28447 COLOR 0,0,-1,-25600,-1,-1 

Q PIXELS 89,79 TO 192,422 STYLE 28447 COLOR 0,0,0,-25600,-1,-1 

$ PIXELS 115,98 SAY "Entry #:• STYLE 65536 FONT "Geneva -,12 COLOR 0,0,0,-1,-1 -1 

Q PIXELS 115,173 GET Eobject STYLE 0 FONT "Geneva ",12 SIZE 15,142 COLOR 0,0,0,-1,-1,-1 

@ PIXELS 145, B9 SAY -Description' STYLE 65536 FOOT "Geneva a ,12 COLOR 0,0,0,-1,-1 -1 

| ™f ^fto 7 LS^ b ? eC 5 !™ ° «» •Qmeva-.rj SIZE 15,241 COLOR 0,0,0,-1,-1,-1 

@ PIXELS 35,89 SAY "Single Northern search screen" STYLE 65536 FOOT -Geneva- 274 COLOR 0 0.- 

I l 1 ^ 3 2 l*t'll 2 GET Bail STYLE 6553 6 ™K 'ChicagoM2 PICTURE -3*R Continue; Bail out' SIZE 

© PIXELS 175,98 SAY "Clone #s" STYLE 65536 FONT "Geneva-;12 COLOR 0,0,0,-1 -1-1 

0 PIXELS 175,173 GET Numb STYLE 0 FONT "Geneva", 12 SIZE 15,70 COLOR 0,0,0,-1,-1,-1 ' 

Q PIXELS 80,152 SAY 'Enter any ONE of the following:- STYLE 65536 FONT "Geneva 1 , 12 COLOR -1, 

* EOF: Northern (single). fmt 

READ 

IF Bailc2 
CLEAR 

screen 1 off 
R3TORN ■ 
ENDIF 

USE " SrrarrGuy : FoxBASE* /Mac : Fox files : Lookup, dbf u 
SET TALK 'CM 

IF Eobjecto' • 

STORE UPPER (Eobject) to Eobject 

SETT SAFETY OFF 

SORT ON Entry TO "Lookup entry, dbf* 
SET SAFETY ON 

USE "Lookup entry, dbf - 

LOCATE FOR Look«Eobject 

IF .NOT. FOUND () 

CLEAR 

LOOP 

ENDIF 

BROWSE 

STORE Entry TO Searchval 

CLOSE DATABASES 

ERASE u Lookup • entry . dbf " 

ENDIF 

IF Dobjecto* ' 
SET EXACT OFF 
SET SAFETY OFF 

SORT ON descriptor TO "Lookup descriptor.dbf* 

SET SAFETY On 

USB -Lookup descriptor. dbf " 

LOCATE FOR UPPER (TRIM (descriptor) ) =UPPER (TRIM (Dob j ect ) ) 

IF .NOT. FOB© () 

CLEAR 
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LOOP 

ENDIF 

BROWSE 

STOKE Entry TO Searchval 

CLOSE DATABASES 

ERASE "Lookup descriptor. dbf u 

SET EXACT ON 

ENDIF 

IP NurriboO 

USE , SnartGuy:FoxBASE+/MaciFox f il es: clones, dbf ■ 

GO Kunib 

BROWSE 

STORE Entry TO Searchval 
ENDIF 

CLEAR 

? 'Northern analysis for entry ' 
7? Searchval 

9 

• 

? 'Enter Y to proceed' 

WAIT TO OK 

CLEAR 

IP UPPER (OK) o'Y' 
screen 1 off 
RETURN 
ENDIF ■ 

* COMPRESSION SUBROUTINE FOR Library, dbf 
? 'Compressing the Libraries file now. , . ' 

USE ' S mart Guy : FoxBASE* /Mac i Fox files: libraries. dbf 1 
SET SAFETY OFF 

SORT ON library TO "Compressed libraries. dbf 

* FOR entered>0 

SET SAFETY ON 

USE ■Conpressed libraries. dbf " 

DELETE FOR entered- 0 

PACK 

COUNT TO TOT* 
MARK1 n 1 
SW2»0 

DO WHILE SW2=0 ROLL 

IF MARK1 >- TOT 

PACK 

SW2sl 

LOOP 

ENDI? 
GO MARK1 

STORE library TO TESTA 
SKIP 

STORE Library TO TESTB 
IF TESTA = TESTB 
DELETE 
ENDIF 

MARK1 . MARK1+1 
LOOP 

ENDDO ROLL 

* Northern analysis 
CLEAR 

? 'Doing the northern now. . . ' 
SET TALK ON 

USE " SmartGuy :FoxSASE*/Mac: Fox i iles: clones. dbf " 
SET SAFETY OFF 

COPY TO "Hits • dbf • FOR entry= searchval 
SET SAFETY ON 
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CLOSE DATABASES 
SELECT 1 

USE "Canpressed libraries, dbf" 

STORE FSCOOONTO 10 Entries 

SELECT 2 

USE "Hits. dbf" 

Mark*l 

CO WHILE .T. 

SELECT 1 . 

IF Mark>Entries 

EXIT 
ENDIF 
GO MARK 

STORE library TO Jigger 
SELECT 2 

COUNT TO Zog FOR library^ Jigger 
SELECT 1 

REPLACE hits with Zog 

KarfcMark+1 

LOOP 

EMDDO* 

SELECT 1 

BROWSE FIELDS LIBRARY, LIBNAME, ENTERED, HITS AT 0,0 
CLEAR 

? 'Enter Y to print:' 

WAIT TO FRINSET 

IF UPPER (PRINSET) a ' Y 1 

SET PRINT ON 

CLEAR 

H3ECT * ' ' 

SCREEN 1 TYPE 0 HEADING "Screen 1" AT 40,2 SIZE 286,492 PIXELS FONT "Geneva M4 COLOR 0,0,0 
? 'DATABASE ENTRIES MATCHING ENTRY • 
?? Searchval 
? DATE 0 

SCREEN 1 TYPE 0 HEADING "Screen 1" AT 40,'2 SIZE 286,492 PIXELS FONT "Geneva", 7 COLOR 0,0,0, 

LIST OFF FIELDS library, libname, entered, hits 

? 

« 

SELECT 2 

LIST OFF FIELDS NUMBER, LIBRARY, D,S,F, 2 , R,EOTRY, DESCRIPTOR ,R?START, START, RFEND 
SET TALK OFF 
SET PRINT OFF 
ENDIF 

CLOSE DATABASES 
SET TALK OFF 
CLEAR 

DO 'Test print .prg" 
RETURN 
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TABLE 6 



library 

ADENINB01 

AORENOR01 

ADRENOTO1 

AMLBNOT01 

6MARNOT01 

BMARNOT02 

CARDNOT01 

CHAONOT01 

CORNNOTD1 

PBRAGT01 

FIBRAGT02 

RBnANTOI 

FI3RNGT01 

FI8RNGT02 

FlBFWOTOl 

HMC1NOTD1 

HUVELPB01 

HUVENOB01 

HUVESTB01 

HYPONOB01 

KIDNNOT01 

UVRMOT01 

LUNGNOTT01 

MUSCNOT01 

OV1DMOB01 

PANCNOTOI 

PmJNOROI 

PITUNOT01 

PLACMOB01 

SIWWOT02 

SPLNFET01 

SPLNNOT02 

STOMNOT01 

6YNORAB01 

7BLVNOTD1 

7ESTNOT01 

THP1NOB01 

7HP1PEB01 

THP1PLB01 

U937NOT01 



libname 
Inflamed adenoid 
Adrenal gland (r) 
Adrenal gland (T) 
AML blest cells (T) 
Bone marrow 
Bone marrow (T) 
Cardiac muscle (V) 
Chin, hamster ovary 
Corneal stroma 
FibroWaet, AT 5 
Fibroblast, AT 30 
Fibroblast. AT 
Fibroblast, uv 5 
Fibroblast, uv SO 
Fibroblast 
Fibroblast, normal 
Mast cell line HMC-1 
HUVEC IFNJNF.LPS 
HUVEC conrrol 
HUVEC 9hear stress 
Hypothalamus 
Kidney (T) 
Liver (T) 
Lung (D 

Skeletal muscle (T) 
Oviduct 

Pancreas, normal 
Pituitary (r) 
Pituitary (T) 
Placenta 

6 ma II intestine (T) 
Spleerir liver, fete) 
Spleen (7) 
Stomach 
Rheum, synovium 
T+B rymphoblast 
Testis (T) 
THP-1 control 
THP phorbol 
THP-1 phorbol LPS 
U937, monocytic louk 



number library d s f z r 

2304 UB37NOT01 E H C C T 

3240 HMC1NOT01 E H C C T 

8269 HMC1NOT01 E H C C T 

4693 HMC1NOT01 EHCCT 

B9S9 HMC1NOT01 EHCCT 

9139 HMC1NOT01 EHCCT 



entry descriptor 

HUMEF-1B Elongation lador 1-beta 

HUMEF1B Elongation lador 1-bota 

HUMEFlB Elongation factor 1-beta 

HUMEFiB Elongation factor 1-beta 

HUMEF1B Elongation (actor 1-beta 

HUMEF1B Elongation factor 1-beta 



Me t a net a r l 


rfend 


D- 0 


773 


0 370 


773 


0 371 


773 


0 470 


773 


0 327 


773 


0 375 


773 
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WHAT IS CLAIMED IS: 

1. A method of analyzing a specimen containing gene 
transcripts, said method comprising the steps of: 

(a) producing a library of biological sequences; 
5 (b) generating a set of transcript sequences, where 

each of the transcript sequences in said set is indicative 
of a different one of the biological sequences of the 
library; 

(c) processing the transcript sequences in a 

10 programmed computer in which a database of reference 

transcript sequences indicative of reference biological 
sequences is stored, to generate an identified sequence 
value for each of the transcript sequences, where each said 
identified sequence value is indicative of a sequence 

15 annotation and a degree of match between one of the 

transcript sequences and at least one of the reference 
transcript sequences; and 

(d) processing each said identified sequence value to 
generate final data values indicative of a number of times 

20 each identified sequence value is present in the library. 

2. The method of claim 1, wherein step (a) includes 
the steps of: 

obtaining a mixture of mRNA; 

making cDNA copies of the mRNA; 

25 isolating a representative population of clones 

transfected with the cDNA and producing therefrom the 

library of biological sequences. 

3. The method of claim 1, wherein the biological 
sequences are cDNA sequences. 

30 4. The method of claim 1, wherein the biological 

sequences are RNA sequences. 

5. The method of claim 1, wherein the biological 
sequences are protein sequences. 
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6. The method of claim 1, wherein a first value of 
said degree of match is indicative of an exact match, and a 
second value of said degree of match is indicative of a 
non-exact match. 

5 7. A method of comparing two specimens containing 

gene transcripts, said method comprising: 

(a) analyzing a first specimen according to the 
method of claim 1; 

(b) producing a second library of biological 
10 sequences; 

(c) generating a second set of transcript sequences, 
where each of the transcript sequences in said second set 
is indicative of a different one of the biological 
sequences of the second library; 

15 " ( d ) processing the second set of transcript sequences 

in said programmed computer to generate a second set of 
identified sequence values known as further identified 
sequence values, where each of the further identified 
sequence values is indicative of a sequence annotation and 

20 a degree of match between one of the biological sequences 
of the second library and at least one of the reference 
sequences; 

(e) processing each said further identified sequence 
value to generate further final data values indicative of a 

25 number of times each further identified sequence value is 
present in the second library; and 

(f) processing the final data values from the first 
specimen and the further identified sequence values from 
the second specimen to generate ratios of transcript 

30 sequences, each of said ratio values indicative of 

differences in numbers of gene transcripts between the two 
specimens. 

8. A method of quantifying relative abundance of mRNA 
in a biological specimen, said method comprising the steps 
35 of: 

(a) isolating a population of mRNA transcripts from 
the biological specimen; 
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(b) identifying genes from which the mRNA was 
transcribed by a sequence-specific method; 

(c) determining numbers of mRNA transcripts 
corresponding to each of the genes; and 

5 (d) using the mRNA transcript numbers to determine 

the relative abundance of mRNA transcripts within the 
population of mRNA transcripts. 

9. A diagnostic method which comprises producing a 
gene transcript image, said method comprising the steps of: 

10 ( a ) isolating a population of mRNA transcripts from a 

biological specimen; 

(b) identifying genes from which the mRNA was 
transcribed by a sequence-specific method; 

(c) determining numbers of mRNA transcripts 
15 corresponding to each of the genes; and 

(d) using the mRNA transcript numbers to determine 
the relative abundance of mRNA transcripts within the 
population of mRNA transcripts, where data determining the 
relative abundance values of mRNA transcripts is the gene 

20 transcript image of the biological specimen. 

10. The method of claim 9, further comprising: 

(e) providing a set of standard normal and diseased 
gene transcript images; and 

(f) comparing the gene transcript image of the 

25 biological specimen with the gene transcript images of step 
(e) to identify at least one of the standard gene 
transcript images which most closely approximate the gene 
transcript image of the biological specimen. 

11. The method of claim 9, wherein the biological 
30 specimen is biopsy tissue, sputum, blood or urine. 

12. A method of producing a gene transcript image, 
said method comprising the steps of 

(a) obtaining a mixture pf mRNA; 

(b) making cDNA copies of the mRNA; 
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(c) inserting the cDNA into a suitable vector and 
using said vector to transfect suitable host strain cells 
whiqh are plated out and permitted to grow into clones, 
each clone, representing a unique mRNA; 
5 (d) isolating a representative population of 

recombinant clones; 

(e) identifying amplified cDNAs from each clone in 
the population by a sequence-specific method which 
identifies gene from which the unique mRNA was transcribed; 
10 (f ) determining a number of times each gene is 

represented within the population of clones as an 
indication of relative abundance; and 

(g) listing the genes and their relative abundance in 
order of abundance, thereby producing the gene transcript 
15 image ♦ 

13, The method of claim 12, also including the step 
of diagnosing disease by: 

repeating steps (a) through (g) on biological 
specimens from random sample of normal and diseased humans, 
20 encompassing a variety of diseases, to produce reference 
sets of normal and diseased gene transcript images; 

obtaining a test specimen from a human, and producing 
a test gene transcript image by performing steps (a) 
through (g) on said test specimen; 
25 comparing the test gene transcript image with the 

reference sets of gene transcript images; and 

identifying at least one of the reference gene 
transcript images which most closely approximates the test 
gene transcript image. 

30 14. A computer system for analyzing a library of 

biological sequences, said system including: 

means for receiving a set of transcript sequences, 
where each of the transcript sequences is indicative of a 
different one of the biological sequences of the library; 

35 and 

means for processing the transcript sequences in the 
computer system in which a database of reference transcript 
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sequences indicative of reference biological sequences is 
stored, wherein the computer is programmed with software 
for generating an identified sequence value for each of the. 
transcript sequences, where each said identified sequence 
5 value is indicative of a sequence annotation and a degree 
of match between a different one of the biological 
sequences of the library and at least one of the reference 
transcript sequences, and for processing each said 
identified sequence value to generate final data values 
10 indicative of a number of times each identified sequence 
value is present in the library, 

15. The system of claim 14, also including: 
library generation means for producing the library of 

biological sequences and generating said set of transcript 
15 sequences from said library. 

16. The system of claim 15, wherein the library 
generation means includes: 

means for obtaining a mixture of mRNA; 

means for making cDNA copies of the mRNA; 
20 means for inserting the cDNA copies into cells and 

permitting the cells to grow into clones; 

means for isolating a representative population of the 
clones and producing therefrom the library of biological 
sequences. 
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